You are on page 1of 13

DOI 10.

1007/s10559-023-00569-z
Cybernetics and Systems Analysis, Vol. 59, No. 2, March, 2023

NEW MEANS OF CYBERNETICS, INFORMATICS,


COMPUTER ENGINEERING, AND SYSTEMS ANALYSIS

CLASSIFICATION OF PATHOLOGIES ON MEDICAL


IMAGES USING THE ALGORITHM OF RANDOM
FOREST OF OPTIMAL-COMPLEXITY TREES

V. Babenko,1† Ie. Nastenko,1‡ V. Pavlov,1†† O. Horodetska,1‡‡ UDC 004.048+616-079.4


I. Dykan,2† B. Tarasiuk,2‡ and V. Lazoryshinets3

Abstract. The authors propose an approach to the construction of classifiers in the class of Random
Forest algorithms. A genetic algorithm is used to determine the optimal combination and
composition of ensembles of features in the construction of forest trees. The principles of the group
method of data handling are used to optimize the structure of the trees. Optimization of the tree
voting procedure in the forest is implemented by the analytic hierarchy process. Examples of the use
of the proposed algorithm for the detection of pathologies on medical images are provided, as well
as the classification results in comparison with other known analogs.

Keywords: pathology classification, medical images, Random Forest, genetic algorithm, group
method of data handling, analytic hierarchy process.

INTRODUCTION

The development of information and computing technologies has determined the prospects for solving the
problems of identification and forecasting the state of objects of various scale and complexity in a wide range of subject
areas. The resources of modern computer systems provide a synthesis of artificial intelligence models and machine
learning algorithms with a high level of quality in solving applied problems [1]. The emergence of super-large data sets
has led to a steadily growing demand for the use of these algorithms due to their ability to efficiently process data in tasks
of intellectual analysis.
Currently, for the synthesis of artificial intelligence models, the following approaches are most effective: ensemble
learning algorithms (boosting [2, 3], Random Forest [4, 5], stacking [6]) and deep learning algorithms (convolutional [7, 8]
and recurrent [9] neural networks). These approaches compete, having their own advantages and disadvantages.
Ensemble learning is promising for analysis since the selected methodologies for creating a set of models and
convolution of ensembles suggest appropriate technologies for interpreting the resulting solutions. In addition, as
evidenced by the practice of data forecasting contests from the Kaggle service [10], ensemble algorithms of the Random
Forest are among the most efficient for solving most problems [11].
The purpose of the proposed work is to improve ensemble-learning approaches in the class of Random Forest
algorithms to achieve higher accuracy of classification problem solving results. This is especially relevant in the
1
National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute,” Kyiv, Ukraine,

vbabenko2191@gmail.com; ‡nastenko.e@gmail.com; ††pavlov.vladimir264@gmail.com; ‡‡o.nosovets@gmail.com.
2
Institute of Nuclear Medicine and Radiation Diagnostics, National Academy of Medical Sciences of Ukraine,
† ‡ 3
irinadykan@gmail.com; btarasyuk13@gmail.com. Amosov National Institute of Cardiovascular Surgery, National
Academy of Medical Sciences of Ukraine, Kyiv, Ukraine, lazorch@ukr.net. Translated from Kibernetyka ta Systemnyi
Analiz, No. 2, March–April, 2023, pp. 190–202. Original article submitted July 29, 2022.

346 1060-0396/23/5902-0346 ©2023 Springer Science+Business Media, LLC


problems of pathology recognition on medical images [5, 12, 13], which provide decision support during patient
diagnosis. In addition, the development of high-performance classification algorithms is relevant in the case of
application in non-invasive approaches to the diagnosis of pathologies [13, 14]. The development of non-invasive
diagnostic methods is very important for patients, because invasive approaches [14], although considered more accurate,
can cause significant damage to the human body.

ANALYSIS OF STUDIES DEVELOPING THE CONCEPT


OF THE RANDOM FOREST ALGORITHMS

According to the principle of ensemble learning, to solve a certain problem, not one model is used, but many
models, which are called “weak learners.”
The main hypothesis is that by optimally combining an ensemble of models, more accurate and reliable
forecasting results can be obtained. The ensemble learning meta-algorithm contains three main procedures focused on
combining weak learners: bagging, boosting, and stacking.
Random forest is the most vivid representative of bagging, where the main principle is the application of
homogeneous weak learners. They are trained in parallel and independently of each other, then the resulting classifiers
are combined using a deterministic averaging process. This algorithm was invented by Tin Kam Ho [4] in 1995 and is
still relevant in modern applications.
However, the Random Forest has certain disadvantages, which various scientists tried to eliminate during their time.
In [15], to create an improved Random Forest algorithm, a variant of optimization of multidimensional linear threshold
functions as separate functions of the decision tree was proposed. The authors note that the introduced criterion of information
gain used to create trees is discontinuous and difficult to optimize. However, they suggest using its continuous upper bound for
optimization. This provides a significant improvement in classification results compared to the base Random Forest.
The study [16] describes a combination of a Random Forest with an attribute estimation method and an instance
filtering method. The proposed approach improved the results of the multi-class classification problem and provided an
accuracy of about one hundred percent. The reliability of the results was confirmed by using the algorithm on five control
data sets, where in each case the authors’ variant had advantages over the classic forest variant.
Individual versions of Random Forest modeling algorithms may require setting of basic parameters: the number
of decision trees (forming an ensemble) and the number of separation thresholds in each node. Assuming the existence of
subclasses, vectors representing clusters in each class can be specified. The significant dimension of the search space for
these parameters prompted E. Elyan and M. M. Gaber [17] to use a genetic algorithm to optimize decisions. The success
of the approach is confirmed by the improved accuracy of the results for several data sets from different applications.
The authors of this study proposed a new method of improvement in the class of Random Forest algorithms [18, 19].
It is based on the use of the principles of group method of data handling (GMDH) [20] in decision trees’ training process,
which makes it possible to obtain trees of optimal complexity. The algorithm was applied to recognize pathology on
ultrasound images of the liver, where classification accuracy was achieved in the range of 96 to 99%. However,
the proposed modification had several shortcomings, which have been overcome in this work.

RANDOM FOREST ALGORITHM OF OPTIMAL COMPLEXITY

A distinctive feature of the Random Forest algorithm is the use of the principle of bootstrap aggregating (bagging):
parallel training of different trees independently of each other with subsequent aggregation of models to obtain the result.
The purpose of bagging is to neutralize partially the effect of the variability of the forest model due to the
variability of the input data sample. By using the bootstrap, we can reduce the classification error of the test data. As
a result, the model acquires properties not of approximation, but of generalization of results.
The structure of the forest (number of trees) is limited by the error of the test data sample. These mechanisms in
the classic Random Forest implementation are fundamental in preventing model retraining. A number of mechanisms for
optimizing the parameters, structure and aggregation of forest trees to improve the results of classification problems are
proposed below.

347
x1 < 0.705
value_train = 0.61
value_test = 0.476

True False

x2 < 1.444 x2 < 4220


value_train = 0.633 value_train = 0.618
value_test = 0.859 value_test = 0.5

True False False

x2 < 504 x2 < 3951 x3 < 0


value_train = 0.593 value_train = 0.654 value_train = 1
value_test = 0.556 value_test = 0.722 value_test = 1

False True False

x3 < 0 x4 < 3.887 x4 < 0.865


value_train = 1 value_train = 1 value_train = 1
value_test = 1 value_test = 1 value_test = 1

Fig. 1. An example of a tree of optimal complexity.

Construction of a Tree of Optimal Complexity. Usually, in standard versions of Random Forest, when
modeling trees in each node, some threshold value of the independent variable is applied, which is optimal according to
such cost functions as the Gini index [21] or entropy [22].
The authors propose to use the Matthews correlation coefficient (MCC) [23], as the most general form of the
classification quality criterion, which takes into account all the features of the problem, providing a balanced
classification evaluation metric in case of asymmetric filling of classes [23], to determine the thresholds for dividing the
ranges of feature values.
Determining the structure (variables in the nodes) and parameters (thresholds) of the tree consists in iterative
selection of independent variables and their thresholds that provide the maximum of the function
TP ´ TN - FP ´ FN
W= , (1)
(TP + FP ) ´ (TP + FN ) ´ (TN + FP ) ´ (TN + FN )
where TP (True Positives) is the number of correct predictions of the first class, TN (True Negatives) is the number
of correct predictions of the second class, FP (False Positives) is the number of wrong predictions of the first class
(a-error or Type I error), FN (False Negatives) is the number of incorrect predictions of the second class ( b-error
or Type II error).
The decision tree synthesis procedure is proposed to be implemented according to the principles of the GMDH [20]
using two samples: A (training) and B (testing). This inductive approach allows us to obtain models of optimal
complexity, which maximizes the accuracy of the forecast on the test sample B .
In the tree node, the optimal threshold is calculated on the sample A for each independent variable, and the
selection of the variable is carried out on the sample B . Thus, the use of W (1) as a function of the cost and the inductive
approach of GMDH for the tree learning ensures that trees are of optimal complexity (Fig. 1).
Figure 1 shows an example of one of the trees of optimal complexity obtained when solving the problem of
pathology recognition based on ultrasound images of the liver [18, 19]. Classification accuracy on the test sample ranged
from 94 to 97%.
Formation of a Forest of Trees of Optimal Complexity. During the formation of a forest of trees of optimal
complexity, the principle of bagging is used (Fig. 2). The idea of the bagging algorithm was proposed by the statistician
L. Breiman [24, 25] in order to improve the quality of forecasting by combining (aggregating) forecasts obtained on
randomly generated different training data sets. Breiman also proved [26] that bagging allows finding more efficient
results for unstable procedures, which include neural networks and decision trees.

348
Test data

Classifier 1

Classifier 2 Ensemble
classifier

Training sample

Predictions
Classifier n

Bootstrap samples

Fig. 2. Illustrative example of bagging at work.

Subsample A #1
Optimal
complexity tree #1
Subsample B #1

Subsample A #2
Training Optimal
Forest
sample complexity tree #2
Subsample B #2

Subsample A #n Optimal
complexity tree #n
Subsample B #n

Fig. 3. A forest of trees of optimal complexity.

A key component of bagging is the “wisdom of the crowd:” the result is formed as a collective decision, not as
a decision of an individual expert. The advantage of this approach is that, by performing joint prediction, the models
hedge against each other until they make mistakes on the same objects.
A necessary condition for the effectiveness of this principle is the diversity and specialization of models, which in
the context of decision trees is provided by training on different data samples. Trees are sensitive to variation in the input
data, so changes in the sample composition lead to a different tree structure.
To build trees of optimal complexity, bagging forms n training samples, which are divided into subsamples
according to the principle of GMDH: A and B (Fig. 3).

349
TABLE 1. General View of Pairwise
Comparison of Criteria

Criteria y1 y2 ¼ yn
u1 u2 un
y1
u1 u1 ¼ u1
u1 u2 un
y2
u2 u2 ¼ u2
¼ ¼ ¼ ¼ ¼
u1 u2 un
yn
un un ¼ un

Just like the Random Forest method, the proposed algorithm is aimed at forming an ensemble of decision trees,
each of which is obtained on subsamples Ai and B i , and which ( Ài with À j and  i with  j ) only partially intersect for
i ¹ j . Applying the principles of GMDH using the bagging approach, classification trees are formed, and a Random
Forest of trees of optimal complexity (RFTOC) is obtained.
Another feature of the Random Forest is the use of the generation of a random subspace of features for each
tree [4], which to some extent increases the efficiency of forecasting. However, the question of the optimality of the built
decision tree arises, since in the case of repeated running of the algorithm, different results are obtained. This problem is
solved in this paper similarly to [17] using a genetic algorithm [27, 28].
The sequence of actions for solving the problem includes the following steps.
1. For given subsamples Ai and B i , randomly generate k (user-specified) subsets of the independent variables.
2. Obtain k trained trees of optimal complexity.
3. Estimate trees on the validation sample (which did not participate in training) by the value of the objective
function W (1) of the genetic algorithm.
4. Check the condition of stopping the algorithm (the maximum of the objective function has been reached or the
specified epoch limit has been exceeded).
4.1. If the stopping condition is met successfully, then the optimal subset of features of the ith tree is obtained.
4.2. Otherwise, perform genetic operators (selection, crossover/mutation), returning a new generation of k subsets
of independent features.
5. Repeat items 2, 3 until the conditions for stopping the algorithm at item 4 are met.
This algorithm is performed for each ith tree. As a result, a RFTOC consisting of n trees with optimal subsamples
of independent variables is obtained.
Improvement of the Voting Function of the Forest. A natural way to increase the efficiency of the collective
decision-making mechanism is to use optimization procedures to form the voting function. It is proposed to improve the
integration of primary results by means of weighted voting using a multi-criteria decision-making method, namely,
the Saaty analytic hierarchy process [28, 29].
It is proposed to assign a weight to each RFTOC tree, which gives priority to the best models during the voting
procedure. The weighting coefficients are obtained using the mechanism of pairwise comparison (Table 1) of criteria
priorities (in this case, RFTOC trees).
Here, u i is the ordinal number in the list of criteria ranked by the quality of the model.
After substituting u i in the table of pairwise comparisons in the ith row of the table, the geometric mean value is
calculated, each of which is normalized by dividing by the sum of all values, thus reducing them to the interval from 0 to 1.
The quality of the model can be determined from the values of the function W (1) on the validation sample. Let us
give such an example. Let there be a constructed RFTOC of 11 trees. For the validation sample, the function values W are
as follows: {0.821, 0.766, 0.749, 0.747, 0.766, 0.745, 0.575, 0.808, 0.749, 0.821, 0.823}. The ordered series has the form
{2, 4, 5, 6, 4, 7, 8, 3, 5, 2, 1}. After calculating the normalized geometric means by the analytic hierarchy process, we
obtain a list of weight coefficients: {0.14, 0.093, 0.07, 0.047, 0.093, 0.047, 0.023, 0.116, 0.07, 0.14, 0.163}. These
coefficients are the weights of trees of optimal complexity during voting.

350
The final classification follows from the additive convolution function Fac :

Fac = w1 y1 + w2 y2 + K + wn yn , (2)

where yi is the classification result (-1 or 1), obtained by the ith tree of RFTOC.
The value of the function Fac varies in the interval from -1 to 1. For Fac < 0 , the result of the classification will be
the first class, otherwise, the second class.

RESEARCH RESULTS

The developed RFTOC algorithm (technically implemented in the Python programming language) is used for
classification on the dataset of medical images in the task of pathology detection. Two image databases were used, which
are described in more detail below.
Recognition of Liver Pathology. At request of the state institution “The Institute of Nuclear Medicine and
Radiation Diagnostics of the NAMS of Ukraine” a decision-making support system was developed during the
determination of the state of norm–pathology of the patient’s liver.
Specialists of the Institute provided a database of ultrasound examination (ultrasound) of the liver (Fig. 4)
consisting of 163 images of the liver in a normal state and 154 images of pathology.
Pathologies are represented by images of signs of autoimmune hepatitis (53 images), Wilson’s disease (50 images),
hepatitis B (four images), hepatitis C (11 images), steatosis (five images), and cirrhosis (15 images); the 12 images were not
identified as showing a specific pathology, but they were also used in the study. The images are obtained using two
different types of sensors:
· convex ultrasound probe — 152 images of signs: 89 normal and 63 pathology (18 images of autoimmune
hepatitis, 23 images of Wilson’s disease, five images of hepatitis C, five images of steatosis, three images of cirrhosis,
and five images of unknown pathology);
· linear ultrasound probe — 94 images of signs: 50 normal and 44 pathologies (26 images of autoimmune
hepatitis, seven images of Wilson’s disease, two images of hepatitis C, six images of cirrhosis and three images of
unknown pathology). For improved clarity, experts took 71 ultrasound images obtained using a linear sensor in enhanced
mode: 24 normal and 47 pathologies (nine images of autoimmune hepatitis, 20 images of Wilson’s disease, four images
of hepatitis B, four images of hepatitis C, six images of cirrhosis, and four images of unknown pathology).
Each ultrasound image was manually segmented by the Institute’s experts. Segmentation areas (or areas of
interest) are considered the most informative for diagnosing pathology. They were further used as objects of study and
training of the classification system.
In total:
(1) 304 objects of the convex sensor (197 normal and 107 pathological images);
(2) 154 objects of the standard mode linear sensor (80 normal and 74 pathological images);
(3) 124 objects of the enhanced mode linear sensor (35 normal and 89 pathological images)
were used.
Thus, three data samples are formed, for each of which the task of binary classification (norm–pathology) is
performed separately. The existing imbalance of classes on the samples of the convex sensor and the linear sensor in the
enhanced mode should be accounted for the construction of classification models.
The resulting image segmentation regions were not accompanied by metadata that could be used as features of
liver norm–pathology classes. To form a set of features, the authors used texture analysis methods (a description of the
analysis technology can be found in [13, 18, 19]) of all research objects.
The calculated texture features were used in the following classification algorithms:
· logistic regression [30];
· adaptive boosting (AdaBoost) [31];
· Random Forest [4, 5];
· author’s RFTOC algorithm.

351
Fig. 4. Example of an ultrasound image
of the liver (images are depersonalized).

The problem of scalability of features was solved by bringing them to a single scale from 0 to 1 using max-min
normalization.
The most generalized models for all classification algorithms were obtained by dividing each sample into training
(80% of the total sample), validation (10%), and test (10%). The purpose of using the validation and test samples is as
follows: the optimal hyperparameters of each classifier are found on the validation sample, and the obtained models are
independently evaluated on the test sample.
The obtained models were evaluated according to the criteria of accuracy (proportion of correctly classified
objects), F-score (harmonic mean value of sensitivity and specificity) and MCC (1). The results of binary classification
for the research objects of each sensor are presented in Table 2. In parentheses are the results of the application of
RFTOC with majority voting.
Recognition of Ischemic Heart Disease. The problem of recognizing coronary heart disease was formulated by the
state institution “M. M. Amosov National Institute of Cardiovascular Surgery of the NAMS of Ukraine.” Specialists of the
institution provided 154 video recordings of speckle-tracking echocardiography (STE) for 56 patients with suspected coronary
heart disease (in 16 of them, no abnormalities of myocardial kinematics were detected during the examination). The STE video
(Fig. 5) serves as a demonstration of the human cardiac cycle to determine the type of myocardial deformation.
With the help of this technology, it is possible to record an echocardiogram of the heart in  -mode in three
informative projections (Fig. 5): 4-chamber position, 2-chamber position, and 3-chamber position (longitudinal axis).
Each position corresponds to a certain combination of segments of the left ventricle of the heart, which are located in the
basins of the main coronary arteries. STE was performed on patients without the use of a dobutamine test (56 people) and
with the use of a dobutamine test together with an echo stress test (38 people) in the cases where no abnormalities were
detected at rest. Dobutamine doses were administered under the supervision of an anesthesiologist with the consent of the
patients, and the use of the sample was necessarily stopped at the slightest signs of discomfort and/or disturbances of
cardiac activity.
The task is to develop classifiers for each projection of echocardiography for the recognition of coronary heart
disease from the provided video data streams. For data analysis, each video is divided into frames. Since the
echocardiography video shows the complete cardiac cycle, it is logical to assume that the first frame is cardiac systole
and the last is diastole. According to experts, these frames can be the most informative for diagnosing the heart. Based on
this, it was decided to use them for the further task of classification (the total number of such frames was 308). Next,
three heart positions in the B -mode were cut separately from each frame.

352
TABLE 2. Results of the Classification of Areas of Interest

Classification Model Evaluation Metric


Sample
Algorithm Accuracy F-score MCC
Convex sensor
Training 0.835 0.821 0.645
Logistic
Validation 0.733 0.729 0.464
Regression
Test 0.71 0.659 0.333
Training 0.996 0.995 0.991
AdaBoost Validation 0.667 0.641 0.342
Test 0.71 0.635 0.319
Training 1 1 1
Random Forest Validation 0.8 0.792 0.614
Test 0.742 0.631 0.441
Training 1 1 1
RFTOC Validation 1 1 1
Test 0.903 (0.867) 0.886 (0.788) 0.795 (0.693)
Standard mode linear sensor
Training 0.894 0.894 0.789
Logistic
Validation 0.667 0.661 0.327
Regression
Test 0.875 0.873 0.775
Training 1 1 1
AdaBoost Validation 0.867 0.866 0.732
Test 0.75 0.75 0.5
Training 1 1 1
Random Forest Validation 0.867 0.866 0.732
Test 0.875 0.873 0.775
Training 1 1 1
RFTOC Validation 1 1 1
Test 1 (0.875) 1 (0.873) 1 (0.775)
Enhanced mode linear sensor
Training 1 1 1
Logistic
Validation 0.917 0.874 0.775
Regression
Test 0.923 0.902 0.822
Training 1 1 1
AdaBoost Validation 0.583 0.556 0.192
Test 0.615 0.575 0.158
Training 1 1 1
Random Forest Validation 0.917 0.874 0.775
Test 0.615 0.381 -0.192
Training 1 1 1
RFTOC Validation 1 1 1
Test 1 (0.923) 1 (0.902) 1 (0.822)

Thus, three samples were formed, for each of which binary classification problems were solved separately.
The distribution of objects by norm–pathology classes is as follows: 116 normal frames and 192 pathology frames
(4-chamber position), 130 normal frames and 178 pathology frames (2-chamber position), and 92 normal frames and
216 pathology frames (longitudinal axis). The given data are also used in works [5, 32, 33].

353
Fig. 5. An example of the provided STE.

The following set of algorithms was used to solve the problem: logistic regression, adaptive boosting (AdaBoost),
Random Forest, and the author’s RTFOC algorithm. The results of the calculations are presented in Table 3.
In parentheses, the results of the application of RTFOC with majority voting are given.

DISCUSSION OF RESULTS

The author’s RTFOC algorithm demonstrated better classification efficiency compared to other well-known
algorithms in both the first and second problems.
In the problem of recognizing liver pathology using ultrasound images, the classification accuracy of RTFOC
models varied from 90.3 to 100% on the test sample. Among analogs, the best result was shown by logistic regression
models, whose accuracy varied from 71 to 92.3%. Table 2 shows that the maximum accuracy was achieved on samples
of linear sensors of both types due to weighted voting using the method of analysis of hierarchies. Achieving maximum
accuracy on such samples can be explained by their small sizes. This is confirmed by the fact that on a larger sample of
research objects of the convex sensor, the best test result (obtained by the RTFOC algorithm) was 90.3%.
In the problem of recognizing coronary heart disease based on echocardiography video data, the accuracy of
classification of RFTOC models varied from 83.3 to 90.3% on the test sample. The second most efficient results were
shown by Random Forest models with accuracy variation from 77.4 to 90.3%. It is logical to assume that the
deterioration of the results compared to the previous problem is related to the sample sizes, therefore, in the future, it is
planned to refine the RTFOC algorithm and adapt the classifiers to the extended database. From Table 3, it can be seen
that the weighted voting approach (with the exception of the 4-chamber position echocardiography data sample) was
consistently better than majority voting.
Classification results were obtained for the following RFTOC parameters.
· The size of the ensemble of features during the formation of trees of optimal complexity was equal to the square
root of the total number of features used. A similar equivalent is recommended to be specified in the classic Random
Forest algorithm [11].
· The constructed models of RTFOC consisted of 11 trees. This number was chosen due to the best average
prediction for the accuracy values (testing was also performed on forests of five, 15, and 21 trees of optimal complexity).
It is recommended to choose an odd number of trees so that there are no controversial decisions during voting [4].

354
TABLE 3. Frame Classification Results

Classification Model Evaluation Metric


Sample
Algorithm
Accuracy F-score MCC
4-chamber position
Training 0.733 0.721 0.442
Logistic
Validation 0.733 0.683 0.373
Regression
Test 0.742 0.735 0.477
Training 0.984 0.983 0.966
AdaBoost Validation 0.8 0.744 0.489
Test 0.774 0.765 0.533
Training 0.996 0.996 0.992
Random Forest Validation 0.967 0.959 0.921
Test 0.839 0.832 0.667
Training 1 1 1
RFTOC Validation 1 1 1
Test 0.833 0.842 0.667
2-chamber position
Training 0.785 0.784 0.574
Logistic
Validation 0.6 0.583 0.172
Regression
Test 0.742 0.735 0.47
Training 0.996 0.996 0.992
AdaBoost Validation 0.7 0.67 0.342
Test 0.71 0.698 0.398
Training 1 1 1
Random Forest Validation 0.8 0.785 0.569
Test 0.774 0.765 0.533
Training 1 1 1
RFTOC Validation 1 1 1
Test 0.846 (0.806) 0.889 (0.801) 0.735 (0.603)
Longitudinal axis
Training 0.749 0.722 0.453
Logistic
Validation 0.667 0.625 0.313
Regression
Test 0.71 0.676 0.367
Training 0.976 0.971 0.943
AdaBoost Validation 0.767 0.689 0.38
Test 0.774 0.748 0.513
Training 1 1 1
Random Forest Validation 0.867 0.83 0.671
Test 0.903 0.868 0.766
Training 1 1 1
RFTOC Validation 1 1 1
Test 0.903 (0.871) 0.886 (0.8) 0.775 (0.71)

355
Taking into account the optimization of the specified parameters is the subject of further improvement of this
algorithm in order to increase its efficiency.

CONCLUSIONS

Based on the results of the research, a new ensemble learning algorithm “Random Forest of Trees of Optimal
Complexity” (RFTOC) is proposed, which combines the approaches of various known methods, such as Random Forest,
the group method of data handling, the genetic algorithm, and the analytic hierarchy process. The algorithm was used to
solve the problems of classification of pathologies on medical images. A comparison of the efficiency of the RTFOC
algorithm with known analogs shows that it provides higher classification quality results. This algorithm is not specific
for image classification, but can be used universally in diverse applications

REFERENCES

1. I. H. Sarker, “Machine learning: Algorithms, real-world applications and research directions,” SN Comput. Sci.,
Vol. 2, Iss. 3, 160 (2021). https://doi.org/10.1007/s42979-021-00592-x.
2. A. Mayr, H. Binder, O. Gefeller, and M. Schmid, “The evolution of boosting algorithms. From machine learning
to statistical modelling,” Methods Inf. Med., Vol. 53, No. 06, 419–427 (2014). https://doi.org/10.3414/
ME13-01-0122.
3. A. H. Osman and H. M. Aljahdali, “An effective of ensemble boosting learning method for breast cancer virtual
screening using neural network model,” IEEE Access, Vol. 8, 39165–39174 (2020). https://doi.org/10.1109/
ACCESS.2020.2976149.
4. T.-K. Ho, “Random decision forests,” in: Proc. 3rd Intern. Conf. on Document Analysis and Recognition
(Montreal, QC, Canada, 14–16 August 1995), Vol. 1, IEEE (1995), pp. 278–282. https://doi.org/10.1109/
ICDAR.1995.598994.
5. Ie. Nastenko, V. Maksymenko, S. Potashev, V. Pavlov, V. Babenko, S. Rysin, O. Matviichuk, and V. Lazoryshinets,
“Random forest algorithm construction for the diagnosis of coronary heart disease based on echocardiography video
data streams,” Innov. Biosyst. Bioeng., Vol. 5, No. 1, 61–69 (2021). https://doi.org/10.20535/ibb.2021.5.1.225794.
6. B. Pavlyshenko “Using stacking approaches for machine learning models,” in: 2018 IEEE Second Intern.Conf. on
Data Stream Mining & Processing (DSMP) (Lviv, Ukraine, August 21–25, 2018), IEEE (2018), pp. 255–258.
https://doi.org/10.1109/DSMP.2018.8478522.
7. S. Indolia, A. K. Goswami, S. P. Mishra, and P. Asopa, “Conceptual understanding of convolutional neural
network — a deep learning approach,” Procedia Comput. Sci., Vol. 132, 679–688 (2018). https://doi.org/10.1016/
j.procs.2018.05.069.
8. J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang, G. Wang, J. Cai, and T. Chen, “Recent
advances in convolutional neural networks,” Pattern Recognition, Vol. 77, 354–377 (2018). https://doi.org/10.1016/
j.patcog.2017.10.013.
9. A. Sherstinsky, “Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM)
network,” Physica D: Nonlinear Phenomena, Vol. 404, 132306 (2020). https://doi.org/10.1016/j.physd.2019.132306.
10. C. S. Bojer and J. P. Meldgaard, “Kaggle forecasting competitions: An overlooked learning opportunity,” Int. J.
Forecast., Vol. 37, Iss. 2, 587–603 (2021). https://doi.org/10.1016/j.ijforecast.2020.07.007.
11. T. Gururaj, Y. M. Vishrutha, M. Uma, D. Rajeshwari, and B. K. Ramya, “Prediction of lung cancer risk using
random forest algorithm based on Kaggle data set,” Int. J. Recen. Technol. Eng., 2020. Vol. 8, Iss. 6, 1623–1630.
https://doi.org/10.35940/ijrte.F7879.038620.

356
12. G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. W. M. van der Laak,
B. van Ginneken, and C. I. S¿nchez, “A survey on deep learning in medical image analysis,” Medical Image
Analysis, Vol. 42, 60–88 (2017). https://doi.org/10.1016/j.media.2017.07.005.
13. Ie. Nastenko, V. Pavlov, O. Nosovets, V. Kruglyi, M. Honcharuk, A. Karliuk, D. Hrishko, O. Trofimenko, and
V. Babenko, “Texture analysis application in medical images classification task solving,” Biomedical Engineering
and Technology, No. 4, 69–82 (2020). https://doi.org/10.20535/2617-8974.2020.4.221876.
14. Y. Cosgun, A. Yildirim, M. Yucel, A. E. Karakoc, G. Koca, A. Gonultas, G. Gursoy, H. Ustun, and M. Korkmaz,
“Evaluation of invasive and noninvasive methods for the diagnosis of helicobacter pylori infection,” Asian Pac. J.
Cancer Prev., Vol. 17, No. 12, 5265–5272 (2016). DOI: 10.22034/APJCP.2016.17.12.5265.
15. M. Norouzi, M. D. Collins, D. J. Fleet, and P. Kohli, “CO2 Forest: improved random forest by continuous
optimization of oblique splits,” arXiv:1506.06155v2 [cs.LG] 24 Jun (2015). https://doi.org/10.48550/
arXiv.1506.06155.
16. A. Chaudhary, S. Kolhe, and R. Kamal, “An improved random forest classifier for multi-class classification,” Inf.
Process. Agric., Vol. 3, Iss. 4, 215–222 (2016). https://doi.org/10.1016/j.inpa.2016.08.002.
17. E. Elyan and M. M. Gaber, “A genetic algorithm approach to optimising random forests applied to class
engineered data,” Inf. Sci., Vol. 384, 220–234 (2017). https://doi.org/10.1016/j.ins.2016.08.007.
18. I. Nastenko, V. Maksymenko, I. Dykan, O. Nosovets, B. Tarasiuk, V. Pavlov, V. Babenko, V. Kruhlyi,
V. Soloduschenko, M. Dyba, and V. Umanets, “Liver pathological states identification in diffuse diseases with
self-organization models based on ultrasound images texture features,” in: 2020 IEEE 15th Intern. Conf. on
Computer Sciences and Information Technologies (CSIT) (Zbarazh, Ukraine, September 23–26, 2020), Vol. 2,
IEEE (2020), pp. 21–25. https://doi.org/10.1109/CSIT49958.2020.9321999.
19. I. Nastenko, V. Maksymenko, A. Galkin, V. Pavlov, O. Nosovets, I. Dykan, B. Tarasiuk, V. Babenko, V. Umanets,
O. Petrunina, and D. Klymenko, “Liver pathological states identification with self-organization models based on
ultrasound images texture features,” in: N. Shakhovska and M. O. Medykovskyy (eds.), Advances in Intelligent
Systems and Computing V, CSIT 2020; Advances in Intelligent Systems and Computing, Vol. 1293, Springer,
Cham (2021), pp. 401–418. https://doi.org/10.1007/978-3-030-63270-0_26.
20. L. Anastasakis and N. Mort, “The development of self-organization techniques in modelling: A review of the
group method of data handling (GMDH),” Research Report No. 813, University of Sheffield, United Kingdom
(2001). URL: https://gmdhsoftware.com/GMDH_%20Anastasakis_and_Mort_2001.pdf.
21. E. Furman, Y. Kye, and J. Su, “Computing the Gini index: A note,” Economics Letters, Vol. 185, 108753 (2019).
https://doi.org/10.1016/j.econlet.2019.108753.
22. X. Dong, M. Qian, and R. Jiang, “Packet classification based on the decision tree with information entropy,”
J. Supercomput., Vol. 76, Iss. 6, 4117–4131 (2020). https://doi.org/10.1007/s11227-017-2227-z.
23. D. Chicco and G. Jurman, “The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in
binary classification evaluation,” BMC Genomics, Vol. 21, No. 1, 6 (2020). https://doi.org/10.1186/s12864-019-6413-7.
24. L. Breiman, “Bagging predictors,” Technical Report No. 421, University of California, Department of Statistics,
Berkeley, California (1994).
25. L. Breiman, “Random forests,” Mach. Learn., Vol. 45, Iss. 1, 5–32 (2001). https://doi.org/10.1023/A:1010933404324.
26. L. Breiman, “Bagging predictors,” Mach. Learn., Vol. 24, Iss. 2, 123–140 (1996). https://doi.org/10.1007/BF00058655.
27. D. E. Goldberg, Genetic Algorithms in Search, Optimization & Machine Learning, Addison-Wesley Longman
Publishing Co., Inc., Boston (1989).
28. O. Nosovets, V. Babenko, I. Davydovych, O. Petrunina, O. Averianova, and L. D. Zyonh, “Personalized clinical
treatment selection using genetic algorithm and analytic hierarchy process,” Adv. Sci. Technol. Eng. Syst. J.,
Vol. 6, No. 4, 406–413 (2021). https://doi.org/10.25046/aj060446.
29. T. L. Saaty, Decision Making for Leaders: The Analytic Hierarchy Process for Decisions in a Complex World,
RWS Publications, Pittsburgh (1990).

357
30. S. Sperandei, “Understanding logistic regression analysis,” Biochem. Med., Vol. 24, Iss. 1, 12–18 (2014).
https://doi.org/10.11613/BM.2014.003.
31. " "
J. Zizka, "
F. Darena, and A. Svoboda, “Adaboost,” in: Text Mining with Machine Learning, CRC Press, Boca
Raton (2019), pp. 201–210. https://doi.org/10.1201/9780429469275-9.
32. O. Petrunina, D. Shevaga, V. Babenko, V. Pavlov, S. Rysin, and I. Nastenko, “Comparative analysis of classification
algorithms in the analysis of medical images from speckle tracking echocardiography video data,” Innov. Biosyst.
Bioeng., Vol. 5, No. 3, 153–166 (2021). https://doi.org/10.20535/ibb.2021.5.3.234990.
33. Ie. Nastenko, V. Maksymenko, S. Potashev, V. Pavlov, V. Babenko, S. Rysin, O. Matviichuk, and V. Lazoryshinets,
“Group method of data handling application in constructing of coronary heart disease diagnosing algorithms,”
Biomedical Engineering and Technology, No. 5, 1–9 (2021). https://doi.org/10.20535/2617-8974.2021.5.227141.

358

You might also like