You are on page 1of 32

PRODUCT ANALYSIS USING RAPID

Student Name
Contents
Input dataset of Product in Repaid simulation............................................................................................................1
Introduction............................................................................................................................................................1
Knowledge report of the Decision Tree of Product Dataset........................................................................................8
Decision tree:..............................................................................................................................................................8
Dataset:.......................................................................................................................................................................9
Decision Tree of Survival rate in Product.................................................................................................................11
Criterion for root node..........................................................................................................................................12
Prediction Model..................................................................................................................................................12
Prediction chart.....................................................................................................................................................13
Performance..........................................................................................................................................................14
K- means analysis.....................................................................................................................................................15
Differentiation......................................................................................................................................................15
Collection of input (Data Table)...........................................................................................................................16
Results model.......................................................................................................................................................16
Parameters............................................................................................................................................................17
K-model Validation..............................................................................................................................................18
Dividing the data for training that used the chunk variable...................................................................................19
Random Forest Model Analysis...............................................................................................................................20
Input Product data.................................................................................................................................................20
Parameters............................................................................................................................................................21
Result of model.....................................................................................................................................................21
Support Vector Machine (SVM) Model...................................................................................................................26
Training set input..................................................................................................................................................26
Results model.......................................................................................................................................................26
Performance..........................................................................................................................................................30
Input dataset of Product in Repaid simulation
Here we select auto model process
Introduction
Now we select we specific model like tree, random forest, k-means and SVM model

Knowledge report of the Decision Tree of Product Dataset


Decision tree:
Decision tree is like a flowchart structure that consists of nodes: Root, intermediate
and leaf nodes. Root node represents a splitting rule for one specific attribute. It returns
further intermediate node on the basis of decision. Leaf node represents a decision or class
labels (what we want to find out) and branches represents conjunctions of features that lead to
those classes. Each node in the tree is a decision rule.
Initializing from the root node, a feature is evaluated and one of the two node is selected. This
procedure is repeated until a final leaf is reached, which normally represents the Target.
 Root Node: The node that starts the graph. It evaluates the variable that best splits the
data.
 Intermediate Node: The nodes where variables are evaluated but which are not the final
nodes.
 Leaf Node: These are the final nodes of the tree, where the predictions of a category
or a numerical value are made.
For example

Here we want to find, when a person will win or lose the match on the basis of occurrence of head
and tail of a coin. Leaf nodes that are Pass and Fail represents the result (that we want to interpret).

Flip a Coin

Head Tail

Pass Fail

Dataset:
Here we plot decision on the sample dataset Product from Rapid Miner. It consists of 7 attributes
and 916 records. In attribute, there is one special attribute Survived that represents whether
passenger survived or not and 6 regular attributes that plays important role in defining or
predicting survival rate of passenger.
Decision Tree of Survival rate in Product
Criterion for root node
Selects the criterion on which Attributes will be selected for splitting. It can have one of the
following values:

 Information gain: The entropies of all the Attributes are calculated and the one with
least entropy is selected for split. This method has a bias towards selecting Attributes
with a large number of values.
 Gain ratio: A variant of information gain that adjusts the information gain for
each Attribute to allow the breadth and uniformity of the Attribute values.
 Gini index: A measure of inequality between the distributions of label characteristics.
Splitting on a chosen Attribute results in a reduction in the average Gini index of the
resulting subsets.
 Accuracy: An Attribute is selected for splitting, which maximizes the accuracy of the
whole tree.
 Least square: An Attribute is selected for splitting, that minimizes the squared
distance between the averages of values in the node with regards to the true value.

Prediction Model
Prediction chart
Performance
K- means analysis

k equal-sized subsets are created from the input Product Dataset. One subset out of the k is kept as the test
data set (i.e., input of the Testing subprocess). The k - 1 subsets that are still used serve as the training
data set (i.e., input of the Training subprocess). The cross-validation procedure is then carried out k times,
with test data drawn from each of the k subsets precisely once each time. A single estimation is created by
averaging (or combining in another way) the k outcomes from the k iterations. The number of folds
option allows the value of k to be changed.
Differentiation
Validation of Differentiation Split
Similar to the Cross Validation Operator, this operator just divides the data into a training set and just a
test set. As a result, it is comparable to one cross validation iteration.
Divided Data
With this Operator, a Product Dataset is divided into various subsets. You can manually carry out a
validation using it.
Beginning with Validation
A Test Dataset Operator and this Operator are comparable. The Bootstrapping Validation Operator
obtains the classification model by bootstrapping sampling as opposed to dividing up its given Product
Dataset into several subsets. Bootstrapping sampling is replacement sampling.
Split Validation in a Wrapper
The Split Validation Operator and this Operator are comparable. To analyze the attribute weight
technique individually, also it contains a separate Attribute Weighting subprocess.
Collection of input (Data Table)
A Product Dataset is provided to this input port in order to do the cross validation.

Results model
The prediction model trained on the whole Product Dataset is sent via this port. Keep in mind that the
creation will not begin if this port is not connected if you truly require this model.
Performance (IO Object) (IO Object)
This port can be expanded. Any performance vector (the output of a Performance Operator) can be
attached to the inner Testing sub process's output port. The average of the performances across the
number of fold iterations is delivered either by Cross Validation Operator's performance output ports.
Instance set (Product Dataset)
The same Product Dataset that was supplied as input is returned by this port.
Test set of results (Product Dataset)
This port only provides
Parameters
Split on batch attribute
So instead, randomly splitting the data, use the Attribute with the specific role "batch" if this option is
enabled. This allows you to choose precisely which Product Dataset have been used to train the model by
each fold. In this instance, none of the other split parameters are accessible.
Range”
Range:
If this option is enabled, just one Example from the initial is used in the test set (i.e., the input of the
Testing sub process). The training data is comprised of the remaining Examples. Each Examples in the
utilized as test data once as a result of doing this repeatedly. Since there are 'n' Examples in the, it is
repeated 'n' times in total. Cross Validation.
K-model Validation
The 'Deals' data set from the Sample folder is used in this instructional procedure to demonstrate the
Cross Validation Operator's fundamental use.

Three subsets of are created by the Cross Validation Operator. The subgroups will contain consecutive
Examples since the sampling type parameter is set to linear sampling (check the ID Attribute). Within the
Cross Validation Operator's Training subprocess, a decision tree is trained using data from two of the
three subgroups.
The remaining subset in the Testing subprocess is then used to calculate the decision tree's performance.
In order to utilize each subset as a test set once, this is performed three times.
The estimated achievements are sent to the Process' s port after being aggregated across three iterations.
Additionally, the decision tree that was trained using every Example is sent to the result port. The final
outcome of the process is the merged test sets (the test result set output port of the Cross Validation
Operator).
Play around with the Cross Validation Operator's parameters. The input Product Dataset is separated into
a specified number of subsets according to the number - folds parameter. Consequently, it is also the
cross-validation’s iteration count. The subgroups are formed differently depending on the sampling type.

Dividing the data for training that used the chunk variable
This Process demonstrates how to use the Cross Validation Operator's split on batch attribute variable.
The Passenger Class Attribute is set to "batch" mode and the Product Training data set is fetched from
Samples folder. The Cross Validation Operator splits the data set into 3 groups since the split on batch
attribute argument is set to true. Only one Passenger class’s Product Dataset are present in each subgroup.
Two of the subsets are utilized to train the decision tree inside the Training component. The tree is tested
using remaining subset in the Testing subprocess.
Random Forest Model Analysis

The number of trees parameter specifies the number of random trees that make up a random forest. On
bootstrapped subsets of the Product Dataset supplied at the Input Port, these trees are built/trained. An
attribute-specific splitting rule is presented for each node in a tree. Only the subset of Help to prioritize by
the subset ratio requirement is taken into account while choosing the splitting rule. According to the
chosen parameter criterion, this rule separates values in the best possible way. For classification, the rule
separates values from various classes, but for regression, it separates them to lessen estimate error. Up
until the halting requirements are fulfilled, network applications are continually constructed.
Input Product data
Training set
The information used as input to create the random forest model.
Set of Product data
Through this port, the Product data that was provided as input is sent to the output unchanged.
weights (Attribute Weights) (Attribute Weights)
An information from Product that includes attributes and weight values, where each weight denotes the
significance of a feature for a certain attribute. The total of enhancements made possible by a node's
choice of a certain Attribute determines the weight. Depending on the criterion selected, the degree of
improvement will vary.
Parameters
 number_of_trees
 criterion
 information_gain:
 gain_ratio:
 gini_index:
 accuracy:
 least_square:

Range:
 maximal_depth
 apply_prepruning
 minimal_gain
 minimal_leaf_size
 minimal_size_for_split
 number_of_prepruning_alternatives
 apply_pruning
 confidence
 random_splits
Result of model
Support Vector Machine (SVM) Model

This student makes use of Stefan Reupping's my SVM algorithm in Java. This learning technique offers a
quick algorithm and effective results for variety of learning problems, and it may be utilized for both
classification and regression problems. my SVM operates with quadratic, asymmetric, or even linear loss
functions.
Numerous kernel types, including dot, radial, polynomial, neural, anova, epachnenikov, gaussian
combination, and multiquadric, are supported by this operation. These kernel types are explained in the
section on parameters.
Training set input 
A Product Dataset is expected at this input port. This operator may be used on data sets with numeric
attributes but cannot handle nominal characteristics. Because of this, applying this operator is frequently
preceded by using the Nominal to Numerical operator.
Results model
This output port transmits the SVM model. Now, this model may be used with unknown data sets.
Product data
Through this port, the Product data that was provided as input is sent to the output unchanged. This is
often done to examine the Product data in the Results Workspace or to reuse the Product data in
additional operators.
Projection of performance
This port provides an assessment of the SVM model's statistical performance in the form of a
performance vector.
Weights
The attribute weights are delivered via this port. Only the dot kernel type may be used for this; other
kernel types cannot be utilized.
Performance

You might also like