Professional Documents
Culture Documents
Abstract—The paper presents the main research results in the area of data mining application to medicine. We
propose a new information technology of data mining for different classes of biomedical images based on the
methodology of diagnostically relevant information selection and creation of informative characteristics.
Application of Big Data technology in proposed systems of medical diagnostics has allowed to improve the
learning set quality and reduce the classification error. Based on these results, the conclusion is made, that
the usage of many heterogeneous sources of diagnostic information made it possible to improve the overall
quality of the diagnostics.
ISSN 1054-6618, Pattern Recognition and Image Analysis, 2018, Vol. 28, No. 1, pp. 114–121. © Pleiades Publishing, Ltd., 2018.
PARTICULAR USE OF BIG DATA 115
2020
as much Data and Content 35 zettabytes
Over Coming Decade
2009
800.000 petabytes
Storage Gowth
Total Data Healthcare Providers (PB)
15000
Admin
Imaging
10000
EMR
Email
5000
File
Non clin img
0
2010 2011 2012 2013 2014 2015 Reserch
5]. The above mentioned methods may describe data icant breakthrough in technological infrastructure of
mining with varying degrees of accuracy. Under mod- medicine. Computer image analysis became a basic
ern conditions, by reason of too much increasing tool of medical diagnostic systems which allows to
information, its structure becomes more complicated. considerably increase diagnostics quality.
One of the procedures, which help to solve the Medicine is one of few fields which accumulate
tasks of data mining and diagnostic result interpreta- data in huge quantities (Fig. 1) with high velocity and
tion, is the Data Mining technique. It is used for iden- in very diverse formats, i.e. figures (tests data), texts
tifying and analysis of relationships in arrays of semi- (medical notes), videography (e.g., ultrasound investi-
structured information and for build-up of models gation), photography (tomography, X-ray photogra-
which describe behavior of complicated systems. Data phy), and technical signals from electric-signal
Mining means the research and detection by a “mech- recording equipment (electrocardiography, electroen-
anism” (algorithms, artificial intelligence tools) in raw cephalography), etc. According to projections, the
data which were formerly unknown and are practically data volume will be 35 zettabytes with 44-frequency
useful and acceptable for interpretation by a human multiplication by 2020 versus 2009.
being [5].
The most of data increase is produced by unstruc-
tured data (medical imaging, videography, texts,
2. MEDICINE AS A SOURCE OF BID DATA speaking). It may be currently affirmed that 90% of all
Modern medicine is one of the most high-tech medical data is supposed to be unstructured. As a mat-
fields of scientific and practical activities, the high pri- ter of fact, public health data are usually presented in
ority of which is the development of new efficient different departments, in various formats, from several
early-diagnostic techniques in various health prob- sources and different clinical systems (80% of elec-
lems. Some last decades are characterized with signif- tronic medical history and medical images of com-
puted tomography (CT) or magnetic resonance regular changes in diagnostic information in medical
tomography (MRT)). images with various types of diseases includes the use
The Institute for Health Technology Transforma- of new mathematical methods and algorithms of dis-
tion has shown that a human body represents an inex- tributed processing and recognition of biomedical
haustible source of big data [6]. Image archive volumes images for remote diagnostic systems. A common
in medicine annually increase by 20–40%: the data approach has been proposed to the analysis of different
volume of one 3D X-ray computed tomography snap- classes of images based on evaluation of aggregate geo-
shot (3D CT) is about 1 GB; the data volume of one metric and texture parameters of allocated regions of
3D magnetic resonance tomography snapshot (3D interest which are supposed to be a basic feature set for
MRT) is about 150 MB; the data volume of one X-ray
snapshot is about 30 MB; the data volume of one further diagnostic analysis [9].
mammogram snapshot is about 120 MB. There is also As integrated indices of the state of the fundus ves-
a clear trend of rapid growth in the number of wearable sels and coronary heart vessels, a global set of geomet-
devices that are wore on the patient’s body and shoot
ric features is used that is a sufficiently total character-
on-line information. It is expected that about 500 mln
of such devices will be used worldwide by 2018. istic of diagnostic images which allows to perform effi-
Expanding a diagnostic information pattern makes it cient diagnosis of vascular malformations [10, 11]. As
possible to considerably increase the veracity of diag- integrated indices of the state of crystallogramm
nosis of human diseases in personalized medicine. images of biological fluids, a set of texture features is
The objective is to increasingly quantify capacity of proposed that enables to perform efficient diagnosis of
different ways of efficient diagnostics. inflammatory diseases.
The main objective of currently conducted
To detect renal system ultrasonography, bone
research at the Image Processing Systems Institute –
Branch of the Federal Scientific Research Centre X-ray imaging, and lungs CT scanning the polynomial
“Crystallography and Photonics” of Russian Acad- features are suggested that are consistent with textural
emy of Sciences (RAS) under the leadership of the properties of the given classes of images [12].
Academic of the RAS V.A. Soifer is the development
of computer techniques for remote high-performance The efficient feature space technique has been
processing, analysis, and interpretation of medical and developed to analyze diagnostic images based on big
diagnostic images in order to identify cause-and-effect data mining of unstructured information using the
relationships in changes of diagnostic information of methods of statistical analysis [13–18]. Informational
different image classes for various types of diseases. feature analysis is performed by discriminative analy-
The relevance of conducted research is also stipulated sis using separability criteria that depends neither on
by the significance of early diagnostics, prediction of distribution of objects per classes nor on the used clas-
the course and selection of an optimal therapeutic sifier.
approach to the treatment of human diseases. Late
diagnostics or interpretation of changes often results in Remote high-performance processing, analysis,
significant treatment efficiency decrease and disease and interpretation of images to identify main relation-
prevention. Currently used methods of status account- ships are based on the Big Data – Hadoop technique.
ing and formalized description do not always give an The need for its usage is justified with a large size of
aggregate factor pattern required for proper diagnos- arrays and semi-structured information generated by
tics. There is an urgent need for introduction into clin- standard software and hardware used for medical diag-
ical practice of new diagnostic techniques for various
diseases [7, 8]. nostic purposes. The Hadoop technique allows not
only to reduce the time of data pre-processing and
The Image Processing Systems Institute (IPSI) processing for imaging systems, but also to consider-
studies the following imaging classes: human vascular
system imaging for early diagnostics of diabetic reti- ably enhance capabilities for analysis, in terms of new
nopathy; bone tissue X-ray imaging (femoral neck information, of semi-structured or completely
fracture imaging) to diagnose osteoporosis; ultrasonic unstructured data. In order to provide efficient storage
images of a renal system to diagnose pyelonephritis and processing of large volumes of unstructured infor-
and computed tomography scans of lungs to diagnose mation in large-sized image mining, we use, as the
chronic obstructive pulmonary disease (emphysema). basis, the methods of information parallel processing
Research of diagnostic images is comprised of and distributed storage by applying a software infra-
three stages of data processing given in Fig. 2: process- structure of MapReduce distributed computations.
ing of biomedical signals, data analysis, and results The technique is also used to optimize the existing
visualization. data-processing operations, allows to significantly
Big data mining of specified image classes per- reduce storage and processing costs, and ensures high
formed at IPSI RAS to solve tasks of identification of efficiency of data handling [19].
dations on how to use different groups of features in Table 1. Changes of separability criteria and diagnostic
medical practice. accuracy for various types of diagnostic data
The information technology of diagnostic image Rising of classes
data mining involves the following methods and algo- Data set Reliability
separability
rithms:
Bones X-rays 8% 0.9
– the method of region-of-interest automatic
selection in diagnostic images using the region-grow- Ultrasonography 23%
0.87
ing segmentation algorithm. Its use provides an of kidneys
opportunity to reduce probable incorrect recognition Computer tomography 17%
for the problem of lungs CT diagnostics by considering 0.95
of lungs
only diagnostically important image areas; Blood vessels 21% 0.94
– the calculation algorithm for polynomial features
that enable to conform to textural properties of diag-
nostic grayscale images while imposing natural cesses and enable to analyze subclinical morphologi-
restrictions on physical feasibility of calculating qua- cal changes of pathomorphological elements, com-
dratic features, significant in practice, the use of which puterize diagnostic steps, and carry out quantitative
can further reduce possible incorrect detection for monitoring of pathological changes of diagnostic sam-
diagnostics of bone-tissue X-ray images [22, 23];
ples. A special feature is the use of elements of expert
– the method and the algorithm of increasing fea- systems: a database of diagnostic characters; the fea-
ture informative value based on the discriminative
ture-space correlation, discriminative, and cluster
analysis and optimal sampling for training an expert
system of diagnosis of diseases [14, 17, 18]; analysis; and the prognosis of a pathology degree
based on expert assessments.
– the algorithm of reducing feature space dimen-
sions for medical radiological images. A separate The classification and diagnostic testing system
problem is the space dimension of features optimized [24] (Figs. 3, 4) provides tools for correlation and dis-
in the process of correlation of features with textural criminative analysis to form informative feature space,
properties of halftone training-sample images. To tools for optimal sampling based on efficiency separa-
effectively solve the optimization problem for such a bility criteria according to pathology groups, and
large number of parameters it is needed to use a large
amount of training image samples. For this purpose, instruments for cluster analysis to filter out the train-
finite sampling from hundreds of pre-readied images ing sample for the purpose of removing invalid data
is not enough: constantly updated sampling from and obtaining feature standard values according to
thousands of images is needed that is not possible to pathology groups.
get in a particular clinic or to computerize. Instead of
A data mining subsystem allows the user to receive
this, a distributed software environment is built that
allows us to store and process large and constantly a proper degree of pathology, standard feature values
updated sets of diagnostic images at the same time; for each degree of disease pathology, and the progno-
– the method of optimal sampling for training the sis of possible development of diseases, and provides
expert system of diagnosis of vascular malformations diagnostic decisions made.
on the basis of exclusion of discordant observations. The use of Big Data technology in developed med-
Based on the discordant observation hypothesis test- ical diagnostic systems has made it possible, due to
ing, we have developed the algorithm of optimal sam- attracting diverse diagnostic information sources and
pling that enabled to improve the accuracy of diagno-
sis of diseases. To remove invalid data and to obtain more data amounts, to improve the training sample
standard feature values according to pathology groups, and reduce classification errors that ensured increase
the tools of cluster analysis are used to filter out the of diagnosis accuracy up to 95% (Table 1).
training sample. Research of various image classes showed that
Problem-oriented complex software systems have increase of the number of considered diversified diag-
been developed for the analysis of medical and diag- nostic data sources and concurrent discussion of vari-
nostic images to detect pathological changes including ous aspects of how to use big data allowed us to reveal
software tools for quantitative estimates of pathology
degrees based on expert evaluations and proposed new trends, which influence over diagnosis of dis-
classification methods: a complex software system for eases. Despite of the fact that the data are different,
analysis of lungs computed tomography, computer they have some common characteristics that are typi-
systems for analysis of diagnostic images of the fundus cal not so much for medicine as for data mining. Fur-
vessels (OphthalmOffice) and the coronary heart ves- ther acceptance of these characteristics is reflected in
sels (CardiOffice). The software systems allow the user data operating procedures that resulted in improving
to control the analytics and decision-making pro- the quality of diagnosing.
User
Discriminative Filtered
analysis criteria Sample
sample
Classification result
SAMPLE FILTERING SYSTEM
DATA ANALYSIS SYSTEM
Features space clusterization
Data mining
Modality criteria calculation for
sample histograms Standard values generation
Filtered sample
Processing result