You are on page 1of 11

DATA MINING

HEPATOCELLULAR CARCINOMA(HCC)
DATASET MODEL

ELENA BERRAL, DIEGO LÓPEZ AND JORGE GARCÍA


Contenido
Background............................................................................................................................... 2
DataSet Overview ..................................................................................................................... 2
Data Pre-processing .................................................................................................................. 4
Data Transformation ............................................................................................................. 4
Preliminary Analysis .............................................................................................................. 4
Feature Selection .................................................................................................................. 5
PCA ....................................................................................................................................... 5
Filter Methods ...................................................................................................................... 6
Weight by Relief ................................................................................................................ 6
Weight by Correlation ....................................................................................................... 6
Weight by Information Gain .............................................................................................. 6
Weight by Information Gain Ratio ..................................................................................... 6
Strategy for handling Missing Values ......................................................................................... 6
Evolutionary method............................................................................................................. 7
Attributes Selected ................................................................................................................... 7
Predictive Models ..................................................................................................................... 7
Decision Tree ........................................................................................................................ 8
Naïve Bayes........................................................................................................................... 8
WBayes ................................................................................................................................. 8
Rule Induction ....................................................................................................................... 8
Support Vector Machines ...................................................................................................... 9
Neural Networks ................................................................................................................... 9
Deep Learning ....................................................................................................................... 9
AutoMLP ............................................................................................................................... 9
Evaluation of the models ........................................................................................................ 10
Background

Liver cancer is the sixth most frequently diagnosed cancer, with Hepatocellular Carcinoma (HCC)
representing over 90% of all primary liver cancers. A carcinoma is a type of cancer that happens
when an epithelial cell undergoes a malignant transformation. In this case, the epithelial cell
affected is a hepatocyte (liver cell), resulting in HCC.

The tumour may grow only expanding on the liver, or several cancerous nodules may appear
scattered throughout the liver. The second pattern is mostly seen in patients with cirrhosis, and
inevitably this must be one of the attributes used in our model.

DataSet Overview

The dataset used to train and validate our predictive models has been downloaded from the UCI
machine learning repository, collected at a University Hospital in Portugal.

It contains real clinical data collected from 165 patients diagnosed with HCC, gathering the
attributes that have been revealed more important when choosing the correct treatment and
predicting their outcomes for patients suffering from HCC according to expert physicians. Each
instance is formed by 49 attributes, so the data is very imbalanced from the beginning. Ideally,
we would like to follow this formula:

𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 ≫ # 𝑜𝑓 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠²
Right now, the dataset is far from fulfilling this rule of thumb, preventing us from creating a
model of high quality, and in the feature selection section we will discuss how we tackled this
problem.

This dataset is heterogeneous, with 23 quantitative variables and 26 qualitative variables, out of
which 23 where binary (0 or 1 representing yes or no), and the other 23 being ordinal,
representing a grade (for example, encephalopathy degree, ranging from 1 to 3).

Attribute Type/scale Range Mean or mode Missingness


(%)
Gender Qualitative/dichotomous 0/1 1 0
Symptoms Qualitative/dichotomous 0/1 1 10.91
Alcohol Qualitative/dichotomous 0/1 1 0
HBsAg Qualitative/dichotomous 0/1 0 10.3
HBeAg Qualitative/dichotomous 0/1 0 23.64
HBcAb Qualitative/dichotomous 0/1 0 14.55
HCVAb Qualitative/dichotomous 0/1 0 5.45
Cirrhosis Qualitative/dichotomous 0/1 1 0
Attribute Type/scale Range Mean or mode Missingness
(%)
Endemic countries Qualitative/dichotomous 0/1 0 23.64
Smoking Qualitative/dichotomous 0/1 1 24.85
Diabetes Qualitative/dichotomous 0/1 0 1.82
Obesity Qualitative/dichotomous 0/1 0 6.06
Hemochromatosis Qualitative/dichotomous 0/1 0 13.94
AHT Qualitative/dichotomous 0/1 0 1.82
CRI Qualitative/dichotomous 0/1 0 1.21
HIV Qualitative/dichotomous 0/1 0 8.48
NASH Qualitative/dichotomous 0/1 0 13.33
Esophageal varices Qualitative/dichotomous 0/1 1 31.52
Splenomegaly Qualitative/dichotomous 0/1 1 9.09
Portal hypertension Qualitative/dichotomous 0/1 1 6.67
Portal vein Qualitative/dichotomous 0/1 0 1.82
thrombosis
Liver metastasis Qualitative/dichotomous 0/1 0 2.42
Radiological Qualitative/dichotomous 0/1 1 1.21
hallmark
Age at diagnosis Quantitative/ratio 20–93 64.69 0
Grams/day Quantitative/ratio 0–500 71.01 29.09
Packs/year Quantitative/ratio 0–510 20.46 32.12
Performance status Qualitative/ordinal 0, 1, 2, 3, 4 0 0
Encefalopathy Qualitative/ordinal 1, 2, 3 1 0.61
Ascites Qualitative/ordinal 1, 2, 3 1 1.21
INR Quantitative/ratio 0.84–4.82 1.42 2.42
AFP Quantitative/ratio 1.2– 19299.95 4.85
1,810,346
Hemoglobin Quantitative/ratio 5–18.7 12.88 1.82
MCV Quantitative/ratio 69.5–119.6 95.12 1.82
Leukocytes Quantitative/ratio 2.2– 1473.96 1.82
13,000
Platelets Quantitative/ratio 1.71– 113206.44 1.82
459,000
Albumin Quantitative/ratio 1.9–4.9 3.45 3.64
Total Bilirrubin Quantitative/ratio 0.3–40.5 3.09 3.03
ALT Quantitative/ratio 11–420 67.09 2.42
AST Quantitative/ratio 17–553 69.38 1.82
Attribute Type/scale Range Mean or mode Missingness
(%)
GGT Quantitative/ratio 23–1575 268.03 1.82
ALP Quantitative/ratio 1.28–980 212.21 1.82
TP Quantitative/ratio 3.9–102 8.96 6.67
Creatinine Quantitative/ratio 0.2–7.6 1.13 4.24
Number of nodules Quantitative/ratio 0–5 2.74 1.21
Major dimension Quantitative/ratio 1.5–22 6.85 12.12
Dir. bilirrubin Quantitative/ratio 0.1–29.3 1.93 26.67
Iron Quantitative/ratio 0–224 85.6 47.88
Sat Quantitative/ratio 0–126 37.03 48.48
Ferritin Quantitative/ratio 0–2230 439 48.48

Table 1. – Attributes of the HCC data set (Mean is calculated for the quantitative attributes while
mode is calculated for qualitative attributes) -Note that gender is represented as 0 if patient is
male and 1 if female.

Data Pre-processing

Before we start creating and training predictive models, we must assess the quality of the data,
performing transformations and eliminating rows or columns where necessary.

Data Transformation

RapidMiner wasn´t able to detect missing values, so we replaced them with question marks and
marked them as missing values so that they could be recognized by the software.

Secondly, all the data was imported as numerical, and the type of all the non-quantitative
attributes had to be changed to ordinal or binominal. Furthermore, all data was imported as
integers, so we had to change the real attribute´s type from integer to real.

Preliminary Analysis

First, a study of each variable was conducted, and due to the vast number of attributes we will
only show the analysis of the most important ones.

Poner fotos y grafiquillas de algunas variables importantes


Feature Selection
As commented earlier, feature selection is needed in order to compensate for the low number
of instances related to the high number of attributes. This is not the only reason for feature
selection, as it will also increase the quality of our data, by:

Removing attributes with high percentage of missing values: These attributes hinder the
quality of our models because the information obtained from them is very limited. The few
instances that have these attributes are not enough to gain knowledge that could help us make
predictions about all patients of HCC.

Removing attributes that are 0 for all instances: In some cases, all instances shared the same
value, or it was missing. This is essentially giving us no information, and they will not help us gain
knowledge about the problem at hand. We proceeded to eliminate these attributes.

Removing correlated attributes: Correlated attributes are essentially giving us the same
information, even though the values are different, we will gain the same knowledge by creating
our model with only one of those two variables. The variable that was considered more relevant
was chosen. In order to discover and confirm correlated attributes, a correlation matrix was
performed. ( INSERT CORRELATION MATRIX) y explicar?

Removing attributes that give the same information: Some attributes gave the same
information as others. For example, the attribute Smokes (Yes/no) tells us if a patient smokes or
not, and the attribute Packs of cigarettes a year is telling us the same thing (if the patient smokes
or not. As Packs of cigarettes a year yields more information vs Smokes it will be the attribute
that we will use, eliminating Smokes from the database. It is worth noting that it is easier to find
data about if a person smokes or not rather than how many packs a year they smoke, so for
future studies, if different data is going to be used, it may be sensible to use smokes.

After the elimination of these attributes, we still have much more attributes than what would
be ideal, so different filter methods were used to assess the quality of the attributes so that we
could keep the ones that offered more information.

PCA

Principal Component Analysis (PCA) works by reducing the number of attributes into several
Principal components. These principal components are formed by a combination of the most
important attributes from our database. The variance that each attribute apports is added until
we reach the desired percentage of the total variance. In our case, we used 75% and obtained a
total number of NUMERO principal components. We are aware that 75% is a very low number
for a PCA, but we had to make this decision due to the vast number of attributes compared to
instances. It was necessary to find a compromise between the final number of attributes (which
doesn’t fulfil the rule of thumb commented earlier) and a reasonable (though low) variance
percentage. In order to perform the PCA, the numerization of certain attributes was necessary.
This was done with the nominal to numerical operator in rapid miner. The reason behind this is
that PCA calculates the Principal Components using the variance, as commented earlier, and this
is calculated based on numerical data, so nominal attributes are not supported. Missing values
are not supported either, so the strategy commented in the “Strategy for handling missing
values” section was used in order to replace them.
Filter Methods

Weight by Relief

This method´s strengths are its simplicity and effectiveness. In summary, relief measures the
relevance of the attributes by sampling examples and comparing values of each attribute with
the nearest example of the same and different attributes. This helps us estimate the quality of
each attribute, which can be useful when trying to reduce the number of dimensions by
removing the attributes with less quality.

Weight by Correlation

This operator calculates the quality of each attribute by studying the correlation between them.
The higher the weight, the higher the quality of the attribute and the more information it can
yield us. It is worth noting that in order to apply weight by correlation the nominal attributes
have to be numeralized, as we can only deal with numbers to calculate the correlation between
attributes.

Weight by Information Gain

Information gain calculates the weights of attributes based on the information gained
calculating the entropy for each attribute. A higher weight indicates higher quality, and
therefore is better.

Weight by Information Gain Ratio

Information Gain Ratio works in a very similar way to information gain, but with a clear
difference: It penalizes attributes that are different for many or all instances; that is, it penalizes
id-like attributes. We need to take into account that various attributes in our dataset are binary
(0 or 1 representing yes or no), and gain ratio can yield them an unfair advantage, but this will
be taken into account when doing the feature selection, as we will use the knowledge gained
with all methods to finally select manually the best attributes to create our model.

Strategy for handling Missing Values

Previously in this project we have talked about eliminating whole attributes due to the high
number of missing values they contain, but we haven´t talked about the rest of the attributes
containing missing values that were not eliminated. The strategy for handling missing values can
vary depending on various factors. For example, some models are robust about missing values
and when applying them, we will use the data ignoring missing values. This will be seen when
we apply the decision tree predictive model. Other strategies involve eliminating the rows
(instances) containing missing values, but this is not a good solution if there are no instances
containing many missing values. In our case, attributes where the ones that in some cases
presented high percentages of missing values, but in general all the instances had a low
percentage of missing values, so no instances were removed, as it can skew the data
unnecessarily (missing values can be related to special cases). In most cases, when applying
models that do not support missing values, we opted to replace missing values for another value.
In case of ordinal or binominal attributes, the most common practice and one that suited our
problem is to substitute the missing values by the mode, and that was our approach. Regarding
real attributes, we simply computed the average of that attribute and used that to substitute
the missing value.

Evolutionary method

Attributes Selected

After taking into account all filter methods, and removing the attributes as commented earlier,
the final attributes with which we will work are the following:

We have removed NÚMERO attributes and finally have NÚMERO 2 attributes, that will be used
for our predictive models. It is worth noting that there is an exception to this, as one model
(decision trees) has no problem working with too many attributes relative to the number of
instances, all attributes have been used when creating the decision tree.

Predictive Models
In this part of the project, we will talk about the models used to predict the class of our database.
In our case, this was if finally the patient survives after 1 year.

The same procedure was applied to all models. We used cross validation for training and testing,
and for that we added a cross validation operator. This operator divides the data in 10 different
sections and uses 9 for training and the remaining section for validation of the model. This is
repeated 10 times so that every section is used for validation. And the accuracy is computed
using the mean of these 10 tests.
Inside the cross validation operator, we see a division between training (left) and testing (right).
Inside the training window we introduced the predictive model and in the testing part we
applied the model and computed the performance, based in accuracy. The operator
“Performance (Classification)” was used to compute the accuracy of the models.

Decision Tree

We apply a decision tree using the operator “Decision Tree” in RapidMiner. As it was said earlier,
decision trees are robust to missing values, so we did not have to eliminate them nor substitute
them by the mean or mode. Decision trees also support nominal and numerical attributes, so it
was not necessary the application of any nominal to numerical operators.

Decision trees are based on the strategy “divide and conquer”. They are built hierarchically
creating branches from the root of the tree, which is selected by computing which attribute has
more information gain. This process is repeated creating more branches until a final decision is
reached. A certain level of impurity in each node is tolerated, as trying to learn the data perfectly
would lead to overfitting (algorithm memorizes the data and is not able to predict similar cases
that should be affirmative because they are not equal to the training data)

Naïve Bayes

We applied naïve bayes with the operator “Naive Bayes” in RapidMiner. This method is based
on applying Bayes theorem with strong independence assumptions. It is important to be careful
with correlated attributes, as they will have a bigger weight in the decisions as they should, and
we are assuming independence for all attributes. For this reason we have previously eliminated
the correlated attributes, maintaining the ones that offered higher quality.

WBayes

We applied a Weka Bayesian network with the “W-BayesNet” operator. Note that this operator
appears as white. This is because it is not a native rapidMiner operator, so there were less
options to configure it than with rapidMiner operators. Bayesian networks are directed acyclic
graphs where node represent variables, edges represent the probabilistic relationships between
the nodes and missing edges show the conditional independencies between the variables. They
will let us reason with uncertainty.

Rule Induction

A rule system is applied by using the operator “Rule induction” in RapidMiner. They differ with
decision trees in the fact that they are not built hierarchically, not having to provide classification
for all instances. This means that they permit non-exhaustive coverage, so they are a good
choice when attributes have many values but not all are relevant. They are based on coverage
instead of the “divide and conquer” strategy commented earlier. Rules are created in order to
try to cover all instances of the training data set. Similarly to decision trees, they support nominal
and numerical data, so no nominal to numerical was applied or vice versa.
Support Vector Machines

A support vector machine (SVM) is applied by using the operator “SVM” in RapidMiner. SVM´s
construct hyperplanes in a high dimensional space. A good classification will be performed by
the hyperplane that has the largest distance to the nearest training data point of any class. In
case the boundaries between classes can not be defined by hyperplanes kernels are used, but
in our case we decided this was not needed.

In their favour, they do not need as much training data as neural networks, fitting perfectly for
us, as we don´t have many instances, but they also act as black boxes. We had to be careful with
the “curse of dimensionality”, and this is why we applied the model after the feature selection
phase, reducing the dimensionality considerably.

Neural Networks

A neural network is applied by using the operator “Neural Net”. Neural networks try to imitate
the functions of real neurons to perform the predictions. The drawback of this model is that it
acts as a black box, not permitting us extract as much knowledge as we would like. On the other
hand, they are very good universal estimators, so they are suitable for predictive tasks. The
weights are the long-term memory of the network, and they are tweaked as the training phase
progresses until a good accuracy is reached. In the simplest case, each neuron provides an
output value, which can be excitatory or inhibitory. For this purpose, the activation function is
used, and it can be a step function or a sigmoidal function, for example.

The training of the network is performed by minimizing the error between the expected output
and input using gradient descent. This error is calculated by means of Least Mean Squares.

We must take into consideration that neural networks often need large amounts of training
data, and our dataset is not particularly large. Different data augmentation techniques could be
studied in order to try to increase the performance of our dataset, and also to avoid overfitting.

Deep Learning

Deep learning was applied by using the “Deep Learning” Operator. This method is similar to
neural networks. It applies a cascade of multiple layers of non-linear processing units, which
normally are neurons.

AutoMLP

We applied AutoMLP with the operator “AutoMLP”. This model combines ideas from genetic
algorithms and stochastic optimization. It trains in parallel a small ensemble of networks.
Evaluation of the models

Models were evaluated with a paired T-test inside the correlation operator (the output of
performance) in order to try to see if there were statistical differences between them, and
which was the model based on accuracy.

Combination of models

We aimed to obtain better results by combining various models instead of using each model
separately.

Bagging

With this technique, we use bootstrap to generate various training sets. Each pair of sets share
a high percentage of data, so each of the models that learn from that set should be similar to
each other, but this is not always true, as sometimes significantly different models are
obtained.

AdaBoost

In boosting a weight is assigned to each instance of the training set. Boosting is performed in a
series of iterations. In each iteration, a model is learnt by minimizing the sum of weights of the
wrong classifications of the last iteration. Apart from this, in every iteration the weights of the
incorrectly classified examples are increased.

Vote

The “vote” operator in RapidMiner uses an approach where the different algorithms vote for a
classification (1 or 0 for survives or not) and assigns the observation to the class that got more
votes.

References
[1] Breuel T, Shafait F. The Learning Workshop. USA: Utah; 2010. Automlp: Simple, effective,
fully automated learning rate and size adjustment

[2] Santos et al. A new cluster-based oversampling method for improving survival prediction of
hepatocellular carcinoma patients, Journal of biomedical informatics, 58, 49-59, 2015.

[3] Pérez-Cruz PE, Acevedo F. Escalas de estado funcional (o performance status) en cáncer.
Gastroenterol Latinoam. 2014;25:219-26. doi: 10.0716/gastrolat2014n300007

You might also like