Professional Documents
Culture Documents
Data Preparation
Using
RapidMiner
CE5807
Dr. Ng Hsiao Piau
(ng_h_p@nus.edu.sg)
Page 1
Objectives
To carry out an ETL process using RapidMiner by extracting data from an excel file,
transforming and preparing it for the next step (Modeling) in the CRISP‐DM model, and then
finally load (or store) it into a local RapidMiner repository.
A summary of the steps involved in the following exercises:
1. Import data from files/sources.
2. Identify data quality issues.
3. Carry out data preparation (transformation and cleaning).
4. Store the data in a repository
Learning Resources: https://docs.rapidminer.com/latest/studio/operators/
Getting Started
1. Install RapidMiner Studio on your workstation.
2. Download the zip file (containing the lab sheets and EXCEL file – Transactions.xlsx).
Page 2
Understanding the Data
Using a business scenario, this is a company that depends on subscriptions and it is in the
interest of the business to retain existing customers. Let’s carry out Data Understanding
which is a step before Data Preparation in the CRISP‐DM model.
You should have downloaded the EXCEL file – Transactions.xlsx. Open it in Microsoft Excel.
Transactions Data in Excel View
The Transactions.xlsx file has 8 columns and 1000 lines of data. Each row contains
information about a single customer. There is an Age, a Gender, and a PostalCode column.
These three columns contain basic mainly unchangeable information on each customer. The
rest of the data contains information about the buying behaviour of the customers and other
information.
It is easy to understand data within Excel by opening and viewing it. But we want more than
to just looking at the data. We need to analyze it in RapidMiner to get a detailed
understanding in order to identify any data quality issues. So first we need to import the data
into RapidMiner.
1.1 Import Data
1. Click the New Process icon in the toolbar and select a new blank process .
Page 3
Create a new process in RapidMiner
2. Go to the Operators panel and search for the operator Read Excel by typing into the search
field the first few letters of the operator name, as shown below. Then drag and drop it
into the Process panel. You should now see a box corresponding to the Read Excel
operator.
Read Excel Operator
3. When you click onto the Read Excel operator, the Parameters panel on the right displays
the operator’s configuration options. Then click the Import Configuration Wizard button
to open Read Excel's Data Import Wizard. The wizard will guide you through the data
import process in several steps.
Data Import Wizard (Read Excel Operator)
4. At the first step of the wizard, select the file that you want to read by browsing into
Transactions.xlsx's location. Click Next to proceed to step 2.
Page 4
Data Import Wizard – Step 1
5. At step 2, you can select the cells to import. The complete Excel sheet is read by default,
which is what we want for this exercise. There is no selection needed in this case.
Additional info: You can also do a selection on a certain range of cells by using your mouse
drag and hold. To select all cells, you can do a keyboard short cut ctrl‐a.
Data Import Wizard – Step 2
6. At step 3 of the wizard, you can assign annotations to the data rows. Ensure that the
annotation set is Name for the first data row. This means that its values are used as
column headers.
7. Complete the steps before clicking on finish to end the wizard.
Page 5
8. Connect the output ports of Read Excel operator to process result port on the right as
shown below. In this way, the data will be available in Results view for visual inspection.
Click on the Run button and the Results view will be displayed.
Connect output port of Read Excel operator to process results port
Results View
We need to store this imported data in RapidMiner repository so that we can transform
the data to be ready for modelling. Without storing this data in the repository, it means
that you would always need to rerun the Read Excel process whenever you want to import
or view the data.
9. Proceed to save the imported data in RapidMiner's local repository. In this way we will
always have access to a static copy of the data, even if the original excel file gets moved,
deleted or modified later on. Search for the Store operator and drag and drop it on top of
the connection line between the output port of the Read Excel operator and the res port
into the Process panel.
Store Operator
10. The last step is to specify the repository entry parameter which indicates the location
where the data will be stored. Ensure that you set the location at the data folder as shown
below and name the data as churn data.
Page 6
Repository Entry Parameter for Store Operator
Store Data in Local Repository
Page 7
Save a Process
In this current exercise 1.1, we have learnt how to create a new process, load the data from
the excel file and store it in the local repository.
Page 8
1.2 Data Exploration
Let's open the Results view and take a look at the data.
1. Data View
This view will show the same data that we saw in excel but with the following
additional information reported by RapidMiner.
Data View
a) The number of examples (rows) and attributes (columns).
b) There are a lot of question marks (?) in the DateChurn attribute. A question mark
indicates that RapidMiner does not have a value for this cell, it is missing. In the
case of DateChurn that is because there are no values for this attribute in the
input data.
Note: Using the Filter box in the upper right, you can select to show all, only those
with missing attributes, or only those with no missing attributes.
Additional Note:
If a column is too small, you can easily resize it by dragging the border between the
header fields. You can also rearrange the order of the columns. Please note that this
only changes the visualization; it does not change the data itself.
2. Statistics View
You may have questions like "What is the range of a numeric attribute?", "Which
values does a nominal attribute have?" etc. Such information is easily accessible in the
Statistics tab.
a) Click the list on the left to open the Statistics tab.
Page 9
Statistics View
In the Statistics tab a list of all attributes ‐ every column from the data view is now
summarized in its own row. The left‐most column displays the attribute name, the
next column lists the data type, while the third column indicates the number of
missing values.
The right half of the view displays various statistics (depending on the data type):
minimum, maximum, average, and standard deviation for numeric attributes
date range and total duration for dates
most and least frequent values, as well as the frequencies of the values, for
nominal attributes
More information to display by doing the following steps:
b) Click an attribute to display more detail, for example, data distribution in the form
of a histogram for Age.
c) Click Details to list all values in the data, together with their frequency, for nominal
attributes (eg. Gender has 4 different values).
Page 10
d) Click on a open visualisation link, beneath a frequency chart, to open the Charts
tab. In this example we shall open the frequency chart for gender attribute.
e) To return to the Statistics tab, click the corresponding entry on the left.
Page 11
Data Preparation
Remember that in our business scenario, our customers are subscribers. Based on their
actions we want to predict who is likely to churn in the near future such that we can prevent
that by giving incentives and other marketing actions tailored to this group of customers. If a
customer has churned in the past then the ChurnDate tells you the date when he stopped
subscription. If he did not, it means that he is still an existing customer, and hence the churn
date is missing. Using the appropriate algorithms, you can find relationships between
attributes and the label. But before this can happen, we need to ensure that the input data
is as clean as possible.
2.1 Data Inspection
Let’s leverage the statistics tab and identify any data abnormality or quality issues.
a) Missing values
Age and Gender contain one missing value each. For this case it means that we
don’t have this information and one of the reasons could be the customer
probably did not provide during sign‐up.
ChurnDate contains a lot of missing values. In this case, a missing value in
ChurnDate actually has a meaning, and an important one which is, the respective
customer did not churn.
Note: This shows the power of the Statistics tab as in the Data tab it is virtually
impossible to spot one or two missing values in a data set with thousands of rows.
Basically, there are a few ways of dealing with missing values:
Remove the attribute if it contains too many missing values
Introduce a new category value for missing values in a nominal attribute e.g.,
unknown. While this may not add any value to us, it gives RapidMiner a chance to
create rules for examples having an unknown category. For the computer,
unknown means something very different from missing.
Remove the data rows with missing values of the attribute if there are only a few
missing values because we would not be losing much information by removing
them.
b) Data formatting issue
There is formatting issue for payment method attribute which result in 4 different
payment methods instead of three.
c) Data range
Page 12
Customer age should be somewhere between 16 and 110 years of age. Here we
have customers as young as 2 years and as old as 234 years. Obviously, we have
some wrong data here.
d) Gender
There should be exactly two different genders, but there are actually four. In
addition to male and female, there are gender values of mänlich and weiblich
(genders in german). This may be due to one of our employees who comes from
Germany. A consistent naming convention should be set in this case.
e) Irrelevant attribute
HashCode attribute does not contain any valuable and/or interpretable
information.
f) Unique identifer
RowNumber uniquely identifies each customer which can confuse the modeling
algorithms. We would need to ignore this during predictive analysis and that
RapidMiner should treat this attribute as an ID.
2.2 Filter Missing Values
In the following steps we will walk through the process of dealing with each of the data quality
issues that we have just identified. Let’s begin by creating a new process that will handle all
the data preparation steps.
1. Create a new blank process and search for the churn data in the Repository panel
which we have previously stored in the local repository. Once you've found it, drag
and drop it into the Process panel, and you will see that an operator is created named
as Retrieve churn data. This will serve the same purpose as the Read Excel operator in
the last exercise, but we will add a lot of extra steps to properly clean the data. Your
new process should look like the one as shown below.
Retrieve Churn Data
Page 13
2. Return to the Design view and search for the Filter Example operator and drag and
drop into the process panel.
3. Ensure that the Filter Example operator is connected to Retrieve churn data's output
port to its example set input port by clicking onto the output port and drag drop a line
onto the exa port of Filter Examples as in the following diagram.
Filter Examples Operator
4. Click onto the Filter Examples operator in Process panel. And then go to Parameters
panel, click Add Filters. Click onto Add Entry button to create the filters needed to
clean the data as shown below. These filters will keep data rows that meet the
conditions stated. In the first column, select an attribute and if nothing appears, type
in the name (note that the attribute name is case sensitive).
Filter Missing Values
Note: By default an example (data row) is matched (and kept) only if it matches all
conditions. You can change the behaviour to keep an example if it fulfils at least one
of the conditions by selecting the Match any option at the bottom of the dialog.
5. Save and apply the filters with the OK button.
Page 14
6. Run the process and check the results. Your data set should now contain 998 examples
and the Statistics tab should not list any missing values other than ChurnDate.
2.3 Generate Attributes
Since a missing churn date means that the customer is still active, the entries that do not have
churn data should not be removed.
We would want to create a model that simply predicts whether a customer will churn in the
near future or remain loyal. To train this model we need a new Churn attribute with the
following rule:
If the churn date is missing then loyal otherwise churn.
The above rule will result in a nice and neat binominal attribute that contains exactly the
information that we want to predict later on ‐ the perfect label for our supervised learning
problem. In RapidMiner, new attributes can be generated with the Generate Attributes
operator. As input, it expects an example set; the output is an example set that contains the
attributes from the input as well as the newly generated attributes.
1. Return to the Design view and search for the Generate Attribute operator and drag
and drop into the process panel. This operator should come after Filter Examples
operator as shown below. We will continue working on the existing process from the
previous step (there is no need to create a new process).
Generate Attributes Operator
2. Like in the previous step, we connect the ports in the similar manner with exa output
port of Generate Attributes to results port of the process.
Page 15
3. Configure the generation rules by clicking the Edit List button in the Parameters panel.
The function descriptions dialog opens. This is where you specify the generation rules.
In the left column, "attribute name," enter the name of the new attribute; in the right,
under "function expressions", specify the generation rule using RapidMiner’s
expression syntax.
Function Expressions: if(missing(ChurnDate), "Loyal", "Churn")
Edit List of Functions for Generate Attributes Operator
Note: To get editing help with the syntax, click the small calculator button next to
the function expression input field. This opens the Edit Expressions dialog.
Page 16
Expression Function Editor
An expression function can contain simple mathematical calculations as well as
complex string transformations, if‐then constructs, date arithmetic, and much more.
You can browse the possible values for functions and inputs in the dialog at the bottom.
When you click Apply button, RapidMiner runs a validation check. If everything is OK,
the expression is applied to the new attribute named on the previous step.
Explanation on the expression function for new Churn attribute
The if function does a condition check on whether each value of churndate attribute
is missing or not. The missing function takes in the value of the churndate attribute as
an argument and returns “Loyal” if the attribute has a missing value and “Churn” if it
does not.
4. Run the process and check the results.
Page 17
2.4 Filter by Age
There are some values of Age attribute that are out of range. A customer is 234 years old
probably is due to a typo error. And there are other customers as young as 2 years old. In this
case, they are invalid values (assuming our business terms and conditions only allow
customers older than 16). We must get rid of those examples with an invalid Age attribute.
1. Extend the Filter Examples operator that we have previously used to filter missing
values for Age and Gender. Click onto Filter Examples operator and click Add Filters
button to add more entries for filtering examples with Age greater equals to 16 and
less than and equals to 110.
Note: You could also append a second Filter Examples operator (instead of extending
the existing one) to the end of the process. The new operator will contain only the age
filters.
Add More Filters to Remove Invalid Ages
2. Run the process and check the results. The data should now contain 996 data rows.
2.5 Replacing Invalid Genders
The Gender attribute also contains the German terms mänlich (for male) and weiblich (for
female). This can seriously confuse the modeling algorithms as it will be assumed that there
Page 18
are four different terms for genders attribute. So we have to correct this by replacing the
german words with English gender words.
1. Search for Map operator and drag and drop into the current process.
Map Operator
2. Connect the ports of the Map operator.
3. Copy the string from the Statistics tab. Follow the steps to do so:
a) Go to the Statistics tab.
b) Click in the row containing the Gender attribute to expand the listing
c) Scroll to the right and click Details.
d) In the dialog, select the values you want to copy and copy them into the
clipboard by pressing Ctrl + C
View Details of Statistics of an Attribute
4. Search for Map operator and drag and drop into existing process. Set the attribute
filter type as single and identify the attribute to do the mapping.
Page 19
Set Parameters in Map Operator
5. Click on Edit List to provide the values for mapping as shown below.
Edit List for Value Mappings in Map Operator
6. Run the process and check the results.
2.6 Data Type Conversion
To allow learning models handle date timestamps better, we need to change the data type of
date time to a numerical type. One of the ways to do this is to extract a week component in
the date time. In the case of LastTransactionDate attribute, it will be useful to extract weeks
relative to the Epoch where Epoch is defined as the date: '01‐01‐1970 00:00'.
1. Search for Date to Numerical operator and drag and drop into current process.
2. Set the following parameters for the operator as shown below:
Parameters for Date to Numerical Operator
Page 20
3. Run the process and check the results.
2.7 Remove Redundant Attributes
During supervised learning an algorithm tries to find relations between the input attributes
and the label attribute (which we will create at a later step). HashCode may affect the running
of the algorithm as it contains a seemingly random collection of numbers and letters. It is
derived from the other attributes via a so‐called hash function and is useful for fast database
searching or equality comparisons between several roles, but for supervised learning this
does not have any use. Therefore it should be removed.
We need to remove the ChurnDate attribute from the data. By doing so, we allow the learning
algorithm to find real relations between properties such as age, gender, etc., and the churn
behavior.
1. Search for Select Attribute operator and drag and drop into existing process.
Select Attribute Operator
2. Connect the ports of the Select Attribute operator.
Page 21
Select the Attributes to Remove Attributes
3. Set attribute filter type to subset. Tick the Invert Selection box to indicate that selected
attributes are to be removed. Click the Select Attribute button and in the resulting
dialog, move the attribute that you want to remove to the right hand side (from
Attributes to Selected Attributes).
4. Run the process and check the results.
2.8 Set Attribute Role: Define the Function of an Attribute
The rowNumber attribute is useful in the way that it contains a unique value for each data
row hence it is able to identify each customer. We can still keep it in the data set but not used
for any model creation.
We would like to let RapidMiner know that we want to learn a rule for churn and not for the
Age, Gender, or another attribute in our data. We need to assign the predefined label role to
Churn attribute so that RapidMiner knows Churn is our target variable. This means that the
modeling algorithms will try to find rules to guess the customer churn status from the other
regular attributes (this excludes special and label attribtues).
Page 22
1. Search for Set Role operator and drag and drop into existing process.
Set Role Operator
2. Set a target role for Churn attribute in the Parameters panel. And also set additional
role of id to rowNumber since it is a unique identifier.
Parameters for Set Role Operator
3. Run the process and check the results.
Page 23
EXERCISES ‐ DO THESE ON YOUR OWN
2.9 Remove Leading Spaces for Payment Method Attribute
Search for the operator that can remove leading spaces. Try this step on your own.
2.10 Process Structure and Documentation
The following steps shows how to polish the appearance of your process and operators.
1. Rename the operators to reflect their purposes.
2. Demonstrate one example of grouping operators into subprocesses.
3. Add documentations to the process.
2.11 Submission
Got to File and export the process (.rmp file).
Provide a screen shot of the final process on WORD document with your name.
Upload the .rmp file and WORD document to LumiNUS.
Page 24