You are on page 1of 5

Influence of Coupons on Order Patterns

Data Mining Course Project

Jelena Nadj

Jelena Lazarevic

Skolkovo Institute of Science and Technology

Novaya 100
Moscow, Russia

IN2 d.o.o
Vladimira Popovica 40
11070 Belgrade, Serbia



In this report we will describe our approach and achievements for the Data Mining course project - Influence of
coupons on order patters. The project is taken from a competition organized by Prudsys AG.

Data Mining, Coupons, Machine Learning, Cleaning Data

Interesting thing about coupons is how they affect the final

value of the basket, and who actually uses these coupons.
Investigating dependencies in the couponing data can answer different questions like Who would make the purchase
even without a coupon? or Who uses coupons which can
then lead to smarter decisions regarding awarding customers
with coupons and making bigger profit.
In the project, we will use the historical data provided for the
competition in order to try to extract dependencies between
variables, and build a model which will give us prediction for
probabilities that individual coupons will be used, as well as
final basket value.
In the process of getting from the raw data provided to the
predicted values, we did various transformations of the data
- created dummy variables, created derived variables, removed outliers, performed exploratory analysis in order to
detect obvious relationships between the variables, etc. Finally, we tried creating different models using different algorithms and different sets of predictors and compared them.
As all algorithms we tried actually had higher error than the
trivial case of simply predicting that none of the coupons
will we used, and the mean value of basket value, we did not
detect any linear/nonlinear relationship between the given
(Does NOT produce the permission block, copyright
information nor page numbering).
For use with



Coupons are distributed by producers or retailers to be used

to claim lower price when purchasing products. With accessibility of the internet, new ways or shopping and coupon
distribution emerged, so now they are widely distributed
over email, social media, blogs, applications, etc. The goal
of coupons is to bring back customers who seem to have gone
somewhere else, as then have not logged in for a long time,
for example. However, not everyone will spend their time
on claiming coupons, this is usually true only for price conscious customers. For that reason, it is of a great importance
to determine sending coupons to which customers will actually make them spend (more) money, and which are those
which would make their purchase even without the coupon,
and issuing them a coupon will result in lower profit.
The idea of this project is to use the historical data of an
online company to predict the future purchases of new and
returning customers. The data contains time stamps of
coupon generation, purchase, variables describing the product category and brand, whether coupons have been redeemed or not and finally the basket value.
In order to answer these questions and make predictions
for the test set data which does not contain information
about coupon redemption or basket value, we went through
all the steps of data preparation, exploratory analysis in
order to recognize obvious dependencies between variables
and decide which variables should be used as predictors,
and finally building models. Different machine learning and
data mining techniques have been used in order to create
a set of models. We used linear models, random forests,
boosted trees, decision trees with different combinations of
predictors to build different models for all target attributes.
These models were compared on the training set. As we
have seen during the exploratory data analysis there was
no linear relationship between the variables, and even the
non-linear algorithms did not manage to detect any pattern
in the data. All the algorithms performed worse than the
trivial case.
There were many challenges in this task. Some of the variables have missing data, which should be filled in order to
make better predictions. However, since the number of miss-

ing data was high, we decided not to use it. Not all the variables should necessarily be used - high correlation between
variables should be detected, while some of the variables
should be transformed in order to make them suitable as
predictors. Obviously, not all the variables have a logical
connection with all target attributes. This will be discussed
to more details later in this report.



Predictive modeling is one of the most developed areas of

data science. The process of building the model based on
the data consist of many steps. Understanding the data
and preprocessing according to that are among the most
important steps in the process.
Usually, various plots of data are made in order to observe
existing patterns or outliers for example. [1] Statistical properties of the variables can be inspected, and variables can
be centered or scaled for better manipulation. In most of
the cases, some of the observations will have missing values,
which are making the predictions harder, if missing values
are in the variables used as predictors. There are many ways
of imputing the missing values, from simply putting the column average (in case of numerical variables) to basically
making a kind of prediction already on this level based on
the values in other fields. [1, 2]In the preprocessing stage,
and in all others, it is of crucial importance that training
and test set are treated in the same way.
Another possibility is to reduce number of predictors, for
example using the PCA method, in order to create new variables as a weighted sum of existing variables. This is usually
done when there is a high correlation between some variables. The goal is to create a new set which would explain
as much as possible of variance in the data. As correlation
between our variables was really low, we did not perform
Regarding the modelling part, there are many existing algorithms for prediction. Linear regression comes probably
as the simplest tool, while it can be very powerful in many
situations. [3] Models can also be built using decision trees
which give non-linear models and use interactions between
the variables. [4] Random forests [5]are upgrade on trees
modeling, when on each split variables are bootstrapped
and multiple trees are grown. Good side of this model is
accuracy, but often it can lead to overfitting and it can be
very slow. It is usually one of top two prediction algorithms
when applied to problems similar to ours. Another very accurate approach is boosting [6], when lot of (possible weak)
predictors are taken, weighted and added up with the goal
to minimize error on the training set. Along with random
forests, it is usually very accurate when used in prediction
The main challenge is to avoid overfitting [7]. Any model
can be tuned to perfectly fit the training data, but after
reaching certain complexity, the error made on the test set
will start to grow as shown in the literature. Also, for that
reason, test data should never be used for training - in our
case it is not even possible, since we do not have the solution,
but the common way is to test models on the training set
only, using cross validation, and apply the winning model

to the test set only once.



Historical data used in generating coupons over a period

of several weeks is know for the task. One coupon applies to a single product. In addition, the orders are still
given in response to the generation including the information as to which coupons have been redeemed and the total
basket value of each order. Using this data, a model for
predicting coupon redemption and the total basket value
should be learner. The target attributes coupon1Used,
coupon2Used and coupon3Used are described with the
value 0 for coupons not redeemed and the value 1 for coupons
redeemed. The remaining target attribute basketValue
contains the total basket value of the order in the form of a
real number. An analysis as to whether coupons will be redeemed as well as the total basket value should be made for a
portion of the coupon generation. Predictions are given for
each order. Predictions for coupon1Used, coupon2Used
and coupon3Used should be in the interval [0, 1], and the
total basket value can be any positive real number. The final
goal is to minimize the total error over all orders. Example
of the solution file is

The solution will be evaluated using the following error function:


|coupon1U sedi coupon1Pi |
j=1 coupon1U sedj

|coupon2U sedi coupon2Pi |

j=1 coupon2U sedj

|coupon3U sedi coupon3Pi |

j=1 coupon3U sedj



|basketV aluei basketV aluePi |

j=1 basketV aluej


Both training and test data are provided by the organizers. In order to build the model, we use different functions
implemented in R packages for cross-validation, data preprocessing, training the model, etc. Training data has 32
variables and 6052 observations. Test data has 28 variables
(just without the target variables) and 6690 observations.
Data codebook is submitted as a separate file, where all the
variables are described.



In this section we will explain all the steps of preprocessing

and analysis we went through in order to get final models
of the data. Algorithms used for creating regression models
used are random forests, boosting, decision trees, logistic
regression, and linear models.

Data Preprocessing

In the part of exploratory data analysis we examined different characteristics of variables separately, and also as functions of other variables, to see if it is possible to notice some
pattern. First, we checked how often are users coming back,
and if it is possible to infer something for a single user for
example. After extracting the data for the user who visited the shop the highest number of times, we can call him
2bab1752b217fdd3704199dead8fa372, we have not noticed
a pattern in his behavior. He received several groups of
coupons, several times, but his coupon redemption behavior
varied, as well as his total basket value. In the same group
of coupons, he sometimes used them, sometimes not, and
not always the same coupon, while the basket value varied
from 211.8 to 755.8, with the mean of 506.9. So, we found
8 columns that were completely the same, representing the
purchases of 1 user, but with different values for coupon redemption. As most of the time the user did not use any of
the coupons, or just one of them, we could tell that reason
of his purchase were not the coupons, but still there was no
trivial way to predict whether he will use them. Most of the
other users had only 1 or 2 purchases in the shop, so for them
it was definitely impossible to learn habits from their orders.
We decided not to use userID variable as a predictor, since
in our opinion it could only mislead the algorithm.
In the next step we discarded one observation where total
basket value was a strong outlier. Other observations did not
have values that extreme, so we not to throw out anything
In order to see if there is a correlation between variables, we
plotted pairs of variables. In Figure 1 we can see price of the
first item coupon was given for, for all orders, and whether
that coupon was redeemed represented by different colors.
No connection was spotted here, as coupons were and were
not used in the whole spectrum of prices. Similar plots were
obtained for all three coupons.







Figure 2: Coupon redemption as a function of the
time difference between getting the coupon and
making the purchase.
Figure 3 compares the box plots of total basket value, depending on the number of coupons redeemed (0 - 3). We can
see that range between these values varies significantly, but
the mean basketValue, and even 25th and 75th percentile
are really close, so we did not conclude there is a connection
here. We have a similar plot in figure just for the redemption of coupon1, where the similar observations were made,
and no obvious connection was inferred.
Mean values for target attributes are coupon1U sed = 0.24,
coupon2U sed = 0.19, coupon2U sed = 0.17 and basketV alue =




In the Figure 2 we plotted the difference between the time

when coupons were obtained and when the purchase has
been made, in seconds, to see it that somehow affects the
coupon redemption. Here we can also see that both colors
are spread all over the plot.

Time Difference[s]









Figure 1: Price of the first item the coupon was
given for and the coupon redemption.

Exploring the correlation matrix showed very low correlation

between any two variables, and the highest correlation was
In order to use the variables categoryIDs1, categoryIDs2,
categoryIDs3 we transformed them into binary dummy variables, saying if the product belongs to the category or not.
Other factor variables were also transformed into dummy binary variables because some of the algorithms can not handle
more than 50 levels of factor variables.
Three variables brand1, brand2 and brand3 had about
25% of missing values, so we decided not to use them. In
[8] brand of a product has been described as an important
variable when deciding whether to use a coupon. The easiest way of imputing the values would be to use the mode
of the variable. However, since imputing the values could
introduce a lot of noise, as this would be treated as another

For training the models for the data we used different algorithms. First, we tried the linear model, although it did
not make any sense, since when we were exploring the data,
there was no strong linear connection between any of the



As mentioned in the Background section, algorithms which

usually perform the best in these kind of problems are random forests and boosting methods. We build several models, using the random forests algorithms combining different
predictors for all 4 target attributes. When using random
forests, a function which uses cross-validation and different
combination of predictors has been used in order to detect
the best set of predictors for all of the target attributes. For
boosting and decision trees we used 10-fold, 5 times repeated



Number of coupons redeemed

Figure 3: Boxplots of total basket value for different
number of coupons redeemed





Coupon1 redeemed
Figure 4: Boxplots of total basket value depending
on the first coupon redemption.

target attribute, if prediction would not be accurate it could

make the results much worse, especially if there is a relationship between the brand and coupon redemption as suggested
in [8].For that reason, we decided it would be better to completely exclude the variable from the set of predictors.
In order to test the models produced, for every model we
separated the data on 4/5 for training and 1/5 for testing,
in the way so target attribute is properly distributed in both



After getting the probabilities for coupon redemption, we

tried fitting those using the logistic regression, with the logic
that small probabilities correspond to 0 and high to 1, so
error will be reduced for the difference.



For coupon redemption prediction, all the algorithms we

tried gave results worse than the trivial case
P when predicted
that none of the coupons will be used.
|coupon1U sed
coupon1P redicted|/n for our test subset of training data
containing 1210 observations was between 0.34 - 0.36 for
different sets of predictors using the boosting method. Random forest algorithm had error of 0.31- 0.34 for different sets
of predictors. Fitting the predicted probabilities to logistic
binary model reduced the error to 0.27. However, all of these
are worse than simply setting all probabilities to 0, which
gives the equal to the mean which is 0.23. For other two
coupons, errors of the predicted results were similar, so the
difference between them and the trivial solution was even
higher since they are less often redeemed and that variable
has lower mean. For predicting coupon redemption probabilities we tried using only variables directly related to the
coupon, as well as using all variables, as one coupon variables could be indirectly affecting the coupon redemption.
We did not notice big difference between different variables
For modeling the basketValue variable we again tried using
single variables, as well as different combinations of variables
as predictors. Models were fitted using boosting, random
forests and decision trees. In all cases the prediction error
was close or much higher than simply using the mean value
of the variable. Decision trees were simply giving a constant
value very close to the mean value, while other algorithms
predicted values that did not have any connection to the real
values - sometimes the predicted value was higher, sometimes lower, and we did not manage to detect the reason for



As stated in the presentation and the report, we did not

manage to detect any reasonable model that could represent the data and be used for prediction on coupon redemption and basket value for new orders in the shop. As we
extracted the data for separate single users, and did not notice a pattern even when all of the variables possibly used as

predictors were the same, we do not expect there is a general pattern. It is more logical that a single person has their
own shopping habits and routine, than there is a general
pattern for all people. Except of patterns on the customer
level, we also expected patterns on the level of type of the
coupon - that it would usually be redeemed or not, but this
connection has not been detected.
On the other hand, we do believe there are ways in which
customers can be influenced to use coupons or not, but the
data representing those connections has not been included
here. For example, it depends how the marketing is done,
how are the products promoted together - what is likely
to be bought together etc. Also, the fact that we did not
use three of the given variables could have an effect on the
outcome, but we do not expect the effect is strong since in
some extracted data where the value was present we did not
notice its strong correlation with the values of the target
The fact that the data was really bad and hard to work with
made this task more interesting, and educational. We went
through different approaches usually used in the prediction
tasks, especially through a lot of preprocessing methods.
Unfortunately, problems like this are always open-ended,
and it is never known in advance if any pattern really exists.
While showing there is a pattern when it really exists, is
pretty easy, showing there is no pattern is a pretty complicated task, because there is always something that has not
been tried yet.



[1] Andrew Gelman. Exploratory data analysis for complex

models. Journal of Computational and Graphical Statistics,
13(4), 2004.
[2] Yang C Yuan. Multiple imputation for missing data:
Concepts and new development (version 9.0). SAS Institute
Inc, Rockville, MD, 2010.
[3] John Neter, Michael H Kutner, Christopher J Nachtsheim,
and William Wasserman. Applied linear statistical models,
volume 4. Irwin Chicago, 1996.
[4] J. Ross Quinlan. Induction of decision trees. Machine
learning, 1(1):81106, 1986.
[5] Leo Breiman. Random forests. Machine learning, 45(1):532,
[6] Yoav Freund, Robert Schapire, and N Abe. A short
introduction to boosting. Journal-Japanese Society For
Artificial Intelligence, 14(771-780):1612, 1999.
[7] Douglas M Hawkins. The problem of overfitting. Journal of
chemical information and computer sciences, 44(1):112,
[8] Caroline M Henderson. Modeling the coupon redemption
decision. Advances in Consumer Research, 12(1):138143,