You are on page 1of 2

DATA ANALYTICS 01

MORE MODELING
With the preprocessing operators we have discussed so far you can blend and prepare most data
sets for building a predictive model. In this lecture, we will use one of the most widely used
machine learning methods, namely a Decision Tree, to predict who will survive the Titanic
accident. Of course, there is nothing you can do about this now, AFTER the ship sunk, but you
can still use this model for similar situations and make predictions then. Should you really buy a
third class ticket when traveling with your family? The model will show!
RETRIEVE THE TITANIC DATA.

1. Drag the Titanic data into your process.


2. Add Set Role, connect it, and configure it as you did in the previous Lecture. Change the
role of the attribute Survived to label.

Note: Remember that the attribute with role label is the one you want to predict. It is important
to set the label, because there are machine learning methods, like the decision tree algorithm,
that use existing data with known label values (a training set) to find hidden patterns. It then
creates predictions from those patterns and applies them to new data without known labels (the
testing set).
REMOVE UNNECESSARY ATTRIBUTES.

1. Add Select Attributes to the process and connect it.


2. Set attribute filter type to subset and click Select Attributes.
3. In the resulting dialog, select the Survived, Sex, Passenger Class, Passenger Fare, and the
No of... parents, children, siblings, and spouses.

Note: You removed (didn't select) Life Boat because passengers who made it on a life boat are
likely survivors. Adding this information would lead to trivial models practically only depending
on this piece of information. The real question is actually: who made it to a life boat in the first
place? Name and ticket number are different kinds of ID, so you left them out as well.
BUILD A DECISION TREE MODEL.

1. Drag in the Decision Tree operator, connect the input, and connect the "mod" output
port to the results port.

Note that the data connections are blue while the model connections are green. This
helps to easily find and verify the correct connection ports.

Page 1 of 2
Dr. Stephan Kupsch
DATA ANALYTICS 01

1. Run the process.


2. Inspect the decision tree model.

Note: It is interesting to see that for women, family size matters more than passenger class. This
behavioral pattern could not be detected for men. In general, men had a lower likelihood to
survive ("women and children first!").
After this you must have learned how to use the most common data preprocessing operators
and even built your first predictive model in RapidMiner. This is an exciting moment - celebrate!

TASKS:

 Can you find out how to restrict the depth of the decision tree, i.e. reduce its complexity?
Why could this be a good idea?
 Limit the depth of the decision tree to 4. Use the parameter setting you found above.
 Re-execute the process and look at the reduced tree. It should only have a depth of 4
now. The width of each colored bar in the tree represents how many passengers fall into
this bucket. Can you figure out who was the largest group of survivors and hence has the
highest likelihood to survive?
 What would you say was the rough probability for survival for this group? How does this
compare to the survival probability for men?

Page 2 of 2
Dr. Stephan Kupsch

You might also like