You are on page 1of 10

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2986259, IEEE Access

SPECIAL SECTION ON MACHINE LEARNING DESIGNS, IMPLEMENTATIONS AND TECHNIQUES

Received January 16, 2020, accepted January 26, 2020, date of publication January 29, 2020, date of current version February 6, 2020.
Digital Object Identifier 10.1109/ACCESS.2020.2970178

A Novel Software Engineering Approach Toward


Using Machine Learning for Improving the
Efficiency of Health Systems
MOHAMMED MOREB 1, TAREQ ABED MOHAMMED2, OGUZ BAYAT1, AND OGUZ ATA1
1GraduateSchool of Science and Engineering, Altinbas University, 34217 Istanbul, Turkey
2Tareq Abed Mohammed, College of Computer Science and Information Technology, University of Kirkuk, Kirkuk 36001, Iraq

Corresponding author: Mohammed Moreb (mahammed.moreb@ogr.altinbas.edu.tr)

ABSTRACT Recently, machine learning has become a hot research topic. Therefore, this study investigates
the interaction between software engineering and machine learning within the context of health systems.
We proposed a novel framework for health informatics: the framework and methodology of software
engineering for machine learning in health informatics (SEMLHI). The SEMLHI framework includes four
modules (software, machine learning, machine learning algorithms, and health informatics data) that orga-
nize the tasks in the framework using a SEMLHI methodology, thereby enabling researchers and developers
to analyze health informatics software from an engineering perspective and providing developers with a new
road map for designing health applications with system functions and software implementations. Our novel
approach sheds light on its features and allows users to study and analyze the user requirements and determine
both the function of objects related to the system and the machine learning algorithms that must be applied to
the dataset. Our dataset used in this research consists of real data and was originally collected from a hospital
run by the Palestine government covering the last three years. The SEMLHI methodology includes seven
phases: designing, implementing, maintaining and defining workflows; structuring information; ensuring
security and privacy; performance testing and evaluation; and releasing the software applications.

INDEX TERMS Health dataset analysis, machine learning, methodology, software development manage-
ment, software engineering.

I. INTRODUCTION for real-world big data [2], including OLAP mass data,
The field of health informatics (HI) aims to provide a large- mass data protection, mass data survey and mass data
scale linkage among disparate ideas. Normally, a healthcare dissemination.
dataset is found to be incomplete and noisy; as a result, Recently, a set of frameworks have been used to develop
reading data from dataset linkage traditionally fails within the data analysis tools such as Win-CASE [3] and SAM [4].
discipline of software engineering. Machine learning (ML) The market has vast data analysis tools that can discover
is a rapidly maturing branch of computer science since it interesting patterns and hidden relationships to support deci-
can store data on a large scale. Many ML tools can be sion makers [5]. BKMR used the R package as a statisti-
used to analyze data and yield knowledge that can improve cal approach on health effects to estimate the multivariable
the quality of work for both staff and doctors; however, exposure-response function [6].
for developers, there is currently no methodology that can Augmentor included the Python image library for aug-
be used. Regarding software engineering, there has been a mentation [7], while for the visualization of medical treat-
lack of approaches to evaluating which software engineering ment plans and patient data, CareVis was used [8], as it was
tasks are better performed by automation and which require designed for this task. Other applications require a visual
human involvement or human-in-the-loop approaches [1]. interface using COQUITO [9]. For health-care data analytics,
Big data has many challenges regarding analysis challenges the widely known 3P tools [10] were used. Many simple
applications, such as WEKA, which provided a GUI for many
The associate editor coordinating the review of this manuscript and machine learning algorithms [11], while Apache Spark was
approving it for publication was Shadi Aljawarneh . used for the cluster computing framework [12], are powerful

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/
VOLUME 8, 2020 23169

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2986259, IEEE Access
M. Moreb et al.: Novel Software Engineering Approach Toward Using Machine Learning for Improving the Efficiency

TABLE 1. Big data analytics tools according to the task. support research and design activities that incorporate exist-
ing knowledge. The SEMLHI framework was composed of
four components that help developers observe the health
application flow from the main module to submodules to run
and validate specific tasks. This enables multiple developers
to work on different modules of the application simultane-
ously. The SEMLHI framework supports the methodological
approach to conducting research on health informatics. It
also supports a structure that presents a common set of ML
terminology to use, compare, measure, and design software
systems in the area of health. This creates a space whereby
SE and ML experts can work on a specific methodological
approach to enable health informatics software development
teams to integrate the ML model lifecycle. Our methodology
systems that can used in various applications for solving prob- was applicable to current systems or in the development of
lems using big data and machine learning [13]. Table 1 sum- new systems that use the ML module for current systems,
marizes the main tools used for big data in analytics according which can be used in regular updates to add data to the
with respect to the task. Software engineering for machine system, to perform irregular updates and to add new features
learning applications (SEMLA) discusses the challenges, such as new versions of ICD diagnosis codes, minor model
new insights, and practical ideas regarding the engineering of improvements for bug fixes, new functionalities required by
ML and artificial engineering (AI) [14]. NSGA-II proposed the client, and new hardware or architectural constraints.
algorithms for real-world applications that include more than
one objective function for enhancing performance in terms II. METHODS AND DISCUSSION
of both diversity and convergence [15]. ML algorithms in Based on original data collected from a hospital run by the
clinical genomics generally come in three main forms: super- Palestine government covering the past three years, first,
vised, unsupervised and semi-supervised [16]. Interflow sys- the data were validated, and all outliers were removed. Then,
tem requirement analysis (ISRA) has been used to determine the remaining data were analyzed using the developed frame-
the system requirements [17]. work to compare ML techniques that predict test laboratory
Electronic healthcare (eHealth) frameworks have replaced results. Our proposed module was compared with three sys-
traditional medical frameworks to improve mobile health- tems engineering methods: Vee, Agile and SEMLHI. The
care (mHealth) and enable patient-to-physician and patient- results were used to implement the prototype system, which
to-patient interactions to achieve improved healthcare and requires a machine learning algorithm. After the develop-
quality of life (QoL) [18]. Big data and IoT have been ment phase, a questionnaire was delivered to the developer
used for improving the efficiency of m-health systems by to indicate the results of using the three methodologies.
predicting potential life-threatening conditions during the The SEMLHI framework was composed of four compo-
early stages [19]. Intelligent IoT eHealth solutions enable nents: software, machine learning model, machine learning
healthcare professionals to monitor health-related data con- algorithms, and health informatics data. The machine learn-
tinuously and provide real-time actionable insights used to ing algorithm component uses five algorithms to evaluate
support decision making [20]. the accuracy of the machine learning models for various
Machine learning is a field of software engineering that components.
frequently utilizes factual procedures to enable PCs to We used the original data as the selected dataset to develop
‘‘learn’’ by using information from saved datasets. Unsu- a patient prediction test laboratory result prediction model,
pervised or information mining focuses more on exploratory and the patient was required to perform more than one test.
information investigation and is known as learning supported In this article, we focus on helping patients and doctors com-
by data analytics. Patient laboratory test queue management plete their treatment tasks by using predictable test results
and wait time prediction are a challenging and complicated based on the International Classification of Diseases [22]
job. Because each patient might require different phase oper- (ICD-10) and helping hospitals save time and reduce effort
ations (tasks), such as a check-up, various tests, e.g., a sugar dedicated to medical testing. Using the SEMLHI framework,
level test or blood test, X-rays or surgery, each task can realistic patient data were analyzed carefully and rigorously
consider different medical tests, from 0 to N , for each patient based on important parameters such as age, start time, end
according to their condition. time, patient treatment, and detailed treatment content for
In this article, based on a grounded theory method- each task. We identified the laboratory tests required for
ology [21], the researchers proposed a novel methodol- patients based on their conditions and the operations per-
ogy, SEMLHI, in developing a framework by defining the formed during treatment. The patient data included only codi-
research problem and methodology for the developers. The fied variables, including ICD-10 codes, procedure codes, and
SEMLHI framework includes a theoretical framework to medication orders, often reduced to smaller subsets.

23170 VOLUME 8, 2020

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2986259, IEEE Access
M. Moreb et al.: Novel Software Engineering Approach Toward Using Machine Learning for Improving the Efficiency

FIGURE 1. Bed rate for patients by city.

A. DATASET AND PREPROCESSING


The application delivery of applied ML models in health-
care was often hampered by the existence of isolated prod-
uct deployments with poorly developed architectures and FIGURE 2. Main features used on the sample dataset.

limited or nonexistent maintenance plans. The ‘‘Translat-


ing Research into Agile Development’’ (TRIAD) method TABLE 2. Comparison of the three system engineering methods, Vee,
Agile and SEMLHI, proposed in this article.
presents a five-step method for designing a tailored EHR
tool [23].
The SEMLHI models and methodology were developed
by including new software systems connected to real datasets
and presented knowledge from the data using ML algorithms
to improve the efficiency of the required system. The dataset
case studies discussed in this article were set within the
context of Palestine hospitals and centers. Three hospitals and
nine medical centers were used for our dataset. Figure 1 illus-
trates the summary of bed rats per 1k patients distributed
across 12 cites. Furthermore, data collection was conducted
over the last three years, and 458k patients were identified
with corresponding patient nos. Overall, for the PMC dataset, center value and dispersion. The data are represented as
141k patients with 1.63% missing, a mean of 1.08M, a std dev L= {l1, l2, l3, l4, . . . , ln}, where l is the item of the laboratory
of 554k, a min of 10k, and a max of 1.04M were included. test reports, and n is the number of items.
For the age label, 141k patients with 1.63% missing, a mean
of 32.24, a std dev of 26.25, a min of 0, a max of 88, and a III. CONCEPTUAL FRAMEWORK AND DESIGN
median of 29 were considered. A. SEMLHI METHODOLOGY
The SEMLHI methodology is used in software development
B. AVAILABLE FEATURES OF PATIENTS in the health area. For traditional applications, the devel-
The patient dataset included 457914 cases and nine tables. opment process includes many methodologies, such as
Each table had different features, and many techniques could the waterfall methodology, spiral methodology, and agile
be implemented, such as semantic coordination for intelli- methodology, which can be used to define and develop the
gent databases [24], feature selection problems using genetic software. Table 2 illustrates the results of the comparison
algorithms [25], and new gene-weight mechanisms [26]. between our methodology, the Vee [27] methodology, and the
Some features were connected with other tables to build Agile [28] methodology.
datasets describing the main attribute, as mentioned in the The SEMLHI framework methodology describes in detail
next section. the process that was used when developing health software
The laboratory test data include 200,000 cases (columns); and the mechanism used to integrate and use ML algorithms
each case has a basic attribute such as the patient no., gender, with the development software. The SEMLHI methodology
age, department, diagnosis code, description, and date of the provides a developer with a new road map for designing
lab test. Figure 2 summarizes the main features used on the health applications with system functions and software imple-
sample dataset from all data, along with the distribution, mentation. This framework includes ten stages, starting from

VOLUME 8, 2020 23171

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2986259, IEEE Access
M. Moreb et al.: Novel Software Engineering Approach Toward Using Machine Learning for Improving the Efficiency

FIGURE 3. SEMLHI framework users.


FIGURE 4. SEMLHI conceptual framework.

defining the problem until reaching the stage of development


and ending with the results, as explained in the next section.
To develop the HI system, developers follow many
sequence steps, such as design (encode data, define outliers
and clean the data), implement (verification and validation),
maintain defined workflows, structure information, provide
security and privacy, test the performance, and then release
the software applications. Records in most datasets in HI
are weakly structured and non-standardized. To apply ML
to the HI system, a set of patterns must be used by the
algorithm to predict and visualize the ML algorithm and
generate knowledge. The main patterns that were used in
our framework were the geographic location, patient records,
departments and hospitals, surgical history, obstetric history,
family history, habits, immunization, assessment and plan,
and test results. The next section describes the details of our
framework, which is composed of four components.

B. SEMLHI FRAMEWORK
SEMLHI frameworks were specifically geared toward facil-
itating the development of software applications and include
components that facilitate the analysis of a health dataset.
Many users, as illustrated in Figure 3, will work directly FIGURE 5. SEMLHI framework components.
as developers or system analysts with approach frame-
works or indirectly by using the results. Figure 4 summa-
rizes the proposed framework as a conceptual framework, in algorithms, and health informatics data). Figure 5 shows
addition to the mechanism used to interact with the operating how each module interacts with all modules to work as a
system and hardware. framework.
For software engineers, our frameworks interact with oper-
ating system components that were used by the framework, 1) HEALTH INFORMATICS DATA
and all software manages the device hardware with the main In ML, data are essential, and choosing the methods for
system device used by the framework. presenting and visualizing knowledge is the most impor-
Our framework was composed of four components or mod- tant step. Our dataset sample contains ten columns with
ules (software, machine learning model, machine learning 50k rows (cases). To use a dataset on health informatics

23172 VOLUME 8, 2020

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2986259, IEEE Access
M. Moreb et al.: Novel Software Engineering Approach Toward Using Machine Learning for Improving the Efficiency

FIGURE 6. Machine algorithm model components.

data (HID) algorithms, a transformation into numerical and cropped from the dataset. Addressing incomplete data in
features was required. Other data contain missing, dupli- unsupervised clustering, chi-square and Fisher’s exact tests
cate or null values, such as negative ages and extremely were performed to determine the patterns that are discrimi-
large integers, which could negatively affect the performance nating between pair clusters [29].
of our ML algorithm. Figure 6 describes the main roles in To predict disease, we used ICD-10 with multiple labels,
detecting the methodologies used in the machine algorithm as each patient has an ICD code in their health records,
model, which are classification, clustering, regression and which can affect all regions of the retina. However, there is
reduction. currently no classification system [30] for distinguish-
HID uses data sources and a dictionary for translation ing anterior (peripheral) and posterior (macular) data.
during label encoding to convert each value in a column We hypothesize that these classifications were characterized
to a number to reduce the amount of misinterpreted data by D and refractive features, highlighting the disparity in the
used by Bayesian inference. A node identifier was used to types of disease.
analyze data as a common process with patterns determined Collected electrocardiograph data were used to focus on
using patient-specific research identifiers. A dataset usually the D most common diagnosis cases in the laboratory test
requires multiple records from the same patient to be identi- result database: D= {d1, d2, d3, . . . , dn }, where d is a disease
fied as being related in the deidentified database. For outlier that was applicable to a diagnosis code and n is the number
HID, a set of methods was used in the analysis to find hidden of disease classes using the k-means algorithm with multiple
groups to remove outliers, and in an advanced step, the outlier labels. Algorithm 1 presents the pseudocode for the k-nearest
values of the data that appear to be erroneous need to be found neighbor algorithm for multilevel learning.

VOLUME 8, 2020 23173

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2986259, IEEE Access
M. Moreb et al.: Novel Software Engineering Approach Toward Using Machine Learning for Improving the Efficiency

FIGURE 7. Age and category feature summary with other variables (test results, gender, and ICD-10).

Algorithm 1 k-Nearest Neighbor Algorithm for Multi- optimization, are widely used [31]. The DRFLLS tool gave
level Learning by Using Correlation, Diagnosis Code, and the best estimation to estimate missing values for a dataset
Label Weight With Frequency that has a small rate of missing values [32].
input: Heterogeneous data source, number of k, As we have 750 case categories in our sample test data,
correlation n. training set; represented by a 27-laboratory test, after running this module,
label of x a new dataset that includes 18 columns and 750 rows is gen-
while While condition do erated. Figure 7 summarizes the ages with category features
1- for all heterogeneous datasets, we will work on clustered by laboratory test results.
correlating multiple labels, adding one label for i = l
to t and then joining the table for i = l to t, join table 2) ML ALGORITHMS
2- apply ML to one label based on Freq. DG weight,
Machine learning algorithms (MLAs) are used to compute
and Lowes accuracy
the parameters that might define a model [14], optimize
3- create new role based on step 2
its network topology and improve the system convergence
4- apply new role to all new schema; create new role
without losing information. MLAs including submodules are
if micro = sen, and if test = normal, then mml = 3
listed in Table 3.
5- classify ML based on DC category
As a supervised learning method, k-nearest neighbors
6. predict new disease based on role created;
(KNN) [33] can be used for classification and prediction
end
problems. KNN makes decisions based on the dominant
categories of k objects rather than a single object category.
Figure 8 identifies most of the MLAs used for health classi-
The data module reads the data from data sources, such as fication.
CSV files or any other available sources; this module includes As all the data in our sample of datasets were prepared
a set of algorithms that automatically remove missing values, using the SEMLHI framework, the output method will super-
clean the data to remove noise, and encode some features. vise the ‘‘label data’’ for this KNN algorithm with multiple
Predicting missing values with incomplete data, for classifi- labels and evaluate our result (KNN was used for super-
cation, normally requires decision trees; for a small amount vised learning, while k-means was used for unsupervised
of sample data or large numbers of genes, feature selection learning). k-Means can be used for datasets that include a
techniques, such as genetic algorithms and particle swarm million labeled data points. Approximate nearest neighbors

23174 VOLUME 8, 2020

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2986259, IEEE Access
M. Moreb et al.: Novel Software Engineering Approach Toward Using Machine Learning for Improving the Efficiency

TABLE 3. Machine learning algorithms sub model.

FIGURE 9. High accuracy of logistic regression compared with that of


other algorithms.

FIGURE 10. Mechanism of the machine algorithm model.

analysis, dimensional reduction is applied [34]. The training


set (t) is used, which includes n objects (t) = xi ∈ A: 1
≤ i ≤ 0 that can be in category n of class C1, ck ∈ A,
by applying algorithm f in the evaluation phase to the set,
where ck takes the input x ∈ Cj: 1 ≤ j ≤ k.
For clustering, we need to calculate the distance d between
the two objects x and y by comparing the values of their n
features and applying the Minkowski metric.
FIGURE 8. Machine learning algorithms used for health classification.
3) MACHINE ALGORITHM MODEL
Machine learning helps us extract useful features from a
(ANNs), which is usually 10x - 100x faster than KNN support dataset to address or predict health-related events [35].
vector machines (SVMs), is a good and fast solution for The machine algorithm model (MAM) component includes
many problems and will almost always outperform KNNs. five submodules: read the data, prepare the data, train the
Figure 9 shows that logistic regression has high accuracy model, test and evaluate the model, and predict new data.
compared with expected and real predictions. Figure 10 describes the sequence of these stages.
In supervised learning, the dataset contains ‘n’ rows The challenge for this component was to use the right type
(cases); each case needs to be evaluated using a function of algorithm, which can optimally solve the dataset while
f : A → B to compare with label A or label B according to avoiding high bias or variance. The main component of the
the function ‘f ’ by evaluating E and comparing them to learn MAM was used to analyze the dataset based on the set of
from the training set of n. f has a set n(d). In unsupervised conditions. If the dataset includes > 50 labeled samples,
learning, the data are not labeled. To apply the data in the then classification algorithms will be used for the selection;

VOLUME 8, 2020 23175

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2986259, IEEE Access
M. Moreb et al.: Novel Software Engineering Approach Toward Using Machine Learning for Improving the Efficiency

TABLE 4. Evaluation of the accuracy of machine learning models. for feature filtering, sorting patterns according to different
interestingness measures, templating, and providing details
on demand. Various visualization techniques, such as Even-
tExplorer, ActiviTree, MatrixWave and DecisionFlow, can
be used. Patterns can be clustered using an SOM or projec-
tion method, while plot patterns can use the double-decker
method [37].
This class was used to test the memory or CPU resource
usage for the application. The performance issues were deter-
mined by first measuring them and then profiling the code.
Then, the optimization of that code was carried out using
the benchmark, which was the best choice for comparing the
results to improve the optimization performance. Code smells
found genetic algorithms, used by 22.22%, to be the most
commonly utilized machine learning techniques [38].
In multi-label classification, a prediction containing a sub-
set of the actual classes should be considered better than a
prediction that contains none of them, i.e., predicting two of
the three labels correctly is better than predicting no labels
at all. To measure a multi-class classifier, a misclassification
using micro- and macro-averaging was carried out [16].
The security module has a significant impact on software
FIGURE 11. The software module includes subclasses including reuse, development, maintenance, cost, and quality; security pro-
performance, testing, privacy, and security.
cesses are implemented by integrating security activities and
tools in the software development process, utilizing security
otherwise, cluster algorithms will be applied. If the dataset requirement management, and providing training for
requires prediction of the quantity, regression algorithms will developers.
be used; otherwise, dimensional reduction will be applied.
Based on the original data, five algorithms were used to IV. CONCLUSION
predict the laboratory test results utilizing the MAM com- This article addressed an important HI with ML topic in
ponent of the SEMLHI framework. ML approaches and software engineering by proposing an efficient new method
algorithms can achieve better performances than expert- approach related to software engineering, identified in prior
knowledge-based approaches [30]. ML algorithms use two research studies, using original data sets collected during
types of techniques: supervised learning and unsupervised the last 3 years from a Palestine hospital. This methodology
learning. For the MLA module, we first determine which allows developers to analyze and develop software for the HI
techniques to use; then, we select the most suitable algorithms model and create a space in which software engineering and
to use based on mathematical selection related to certain ML experts can work together on the ML model life-cycle,
criteria. For the different algorithms applied, Table 4 shows especially in the health field. This manuscript proposed a
their accuracy results (KNN classifier, linear SVC, logistic framework that included a theoretical framework composed
regression, multinomial NB, and random forest classifier). of four modules (software, ML model, ML algorithms, and
We compared our approach with previously published HI data). The new methodology was compared between three
systems in terms of performance to evaluate the accuracy system engineering methods: Vee, Agile and SEMLHI. The
of the machine learning models. The accuracy results for results showed the delivery of the new methodology for one-
different algorithms were obtained after applying them to shot delivery. For the MAM component on the SEMLHI
750 cases, with linear SVC having values of approximately framework, laboratory test results were obtained using five
0.57, compared with the KNN classifier, logistic regression, algorithms to test the accuracy of the ICD-10 results using
multinomial NB, and random forest classifier. equations and to evaluate the accuracy of the ML models with
a sample size of 750 patients. The results for MAM showed
4) SOFTWARE that the SVC was approximately 0.57.
The software module, which is visualized in Figure 11,
includes a subclass that includes reuse, performance, testing, AVAILABILITY OF DATA AND MATERIALS
privacy, and security. For software testing, the main point Data that support the findings of this research were available
was to verify that the code was running correctly by test- from The Palestinian Ministry of Health, but restrictions were
ing the code under known conditions and checking that the applied to the availability of these data, which were used
results were as expected [36]. Visual analytic and interactive under license for the current study and thus were not publicly
visualizations offer a higher degree of freedom for users available. Data are, however, available from the authors upon

23176 VOLUME 8, 2020

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2986259, IEEE Access
M. Moreb et al.: Novel Software Engineering Approach Toward Using Machine Learning for Improving the Efficiency

reasonable request and with permission of the Correspon- [6] J. F. Bobb, B. C. Henn, L. Valeri, and B. A. Coull, ‘‘Statistical software for
dence Author. analyzing the health effects of multiple concurrent exposures via Bayesian
kernel machine regression,’’ Environ. Health, vol. 17, no. 1, p. 67, 2018.
[7] B. Aribisala and O. Olabanjo, ‘‘Medical image processor and repository–
ABBREVIATIONS MIPAR,’’ Inform. Med. Unlocked, vol. 12, pp. 75–80, Jul. 2018.
SEMLHI: Software Engineering for Machine Learning in [8] W. Aigner and S. Miksch, ‘‘CareVis: Integrated visualization of comput-
erized protocols and temporal patient data,’’ Artif. Intell.in Med., vol. 37,
Health Informatics no. 3, pp. 203–218, Jul. 2006.
SE: Software Engineering [9] J. Krause, A. Perer, and H. Stavropoulos, ‘‘Supporting iterative cohort
ML: Machine Learning construction with visual temporal queries,’’ IEEE Trans. Vis. Comput.
Graph., vol. 22, no. 1, pp. 91–100, Jan. 2016.
HID: Health Informatics Data [10] R. K. Pathinarupothi, P. Durga, and E. S. Rangan, ‘‘Data to diagnosis in
HI: Health Informatics global health: A 3P approach,’’ BMC Med. Inform. Decis. Making, vol. 18,
MAM: Machine Algorithm Model no. 1, pp. 1–13, 2018.
[11] I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal, Data Mining: Practical
Machine Learning Tools and Techniques. Amsterdam, The Netherlands:
ETHICS APPROVAL AND CONSENT TO PARTICIPATE Elsevier, 2016, pp. 438–441.
The research meets all applicable standards with regard to the [12] Q.-C. To, J. Soto, and V. Markl, ‘‘A survey of state management in big data
processing systems,’’ VLDB J., vol. 27, no. 6, pp. 847–872, Dec. 2018.
ethics of experimentation and research integrity, and the fol- [13] S. R. Salkuti, ‘‘A survey of big data and machine learning,’’ Int. J. Elect.
lowing was certified/declared true. The informed consent of Comput. Eng., to be published. Accessed: Jan. 7, 2020. [Online]. Available:
human participants was obtained in written format, and it was http://ijece.iaescore.com/index.php/IJECE/article/view/19184/pdf
[14] F. Khomh, B. Adams, J. Cheng, M. Fokaefs, and G. Antoniol, ‘‘Software
approved by The Palestinian Ministry of Health. As an expert engineering for machine-learning applications: The road ahead,’’ IEEE
scientist and along with coauthors in the concerned field, Softw., vol. 35, no. 5, pp. 81–84, Sep. 2018.
the paper has been submitted with full responsibility, follow- [15] T. A. Mohammed, Y. I. Hamodi, and N. T. Yousir, ‘‘Intelligent enhance-
ment of organization work flow and work scheduling using machine learn-
ing due ethical procedure, and there is no duplicate publica- ing approach tree algorithm,’’ Int. J. Comput. Sci. Netw. Secur., vol. 18,
tion, fraud, plagiarism, or concerns about animal or human no. 6, pp. 87–90, 2018.
experimentation. [16] J. A. Diao, I. S. Kohane, and A. K. Manrai, ‘‘Biomedical informatics
and machine learning for clinical genomics,’’ Hum. Mol. Genet., vol. 27,
no. R1, pp. R29–R34, May 2018.
CONSENT FOR PUBLICATION [17] P.-H. Cheng, Y.-P. Chen, and J.-S. Lai, ‘‘An interflow system requirement
Not applicable. analysis in health informatics field,’’ in Proc. WRI World Congr. Comput.
Sci. Inf. Eng., vol. 1, 2009, pp. 712–716.
[18] C. George, P. Duquenoy, and D. Whitehouse, ‘‘eHealth: Legal, ethical
COMPETING INTERESTS and governance challenges,’’ in eHealth: Legal, Ethical and Governance
All authors report no conflicts of interest. Challenges, C. George, D. Whitehouse, and P. Duquenoy, Eds. Berlin,
Germany: Springer, 2014, pp. 1–398.
[19] K. N. Mishra and C. Chakraborty, ‘‘A novel approach towards using
FUNDING big data and IoT for improving the efficiency of m-health systems,’’ in
Advanced Computational Intelligence Techniques for Virtual Reality in
Not applicable.
Healthcare, vol. 875. Cham, Switzerland: Springer, 2020, pp. 123–139.
[20] B. Farahani, M. Barzegari, F. Shams Aliee, and K. A. Shaik, ‘‘Towards
ACKNOWLEDGMENT collaborative intelligent IoT eHealth: From device to fog, and cloud,’’
Microprocessors Microsyst., vol. 72, Feb. 2020, Art. no. 102938.
The authors would like to thank Health Minister of the State [21] C. Oliver, ‘‘Critical realist grounded theory: A new approach for social
of Palestine, Dr. J. Awwad, for allowing us to access the work research,’’ Brit. J. Social Work, vol. 42, no. 2, pp. 371–387, Mar. 2012.
Palestinian dataset for patients, and for all the teams that [22] J. Disantostefano, ‘‘International classification of diseases 10th revision
(ICD-10),’’ J. Nurse Practitioners, vol. 5, no. 1, pp. 56–57, Jan. 2009.
supported us during the last two years, the feedback from
[23] K. D. Clark, T. T. Woodson, R. J. Holden, R. Gunn, and D. J. Cohen,
whom greatly improved this manuscript. ‘‘Translating research into agile development (TRIAD): Development of
electronic health record tools for primary care settings,’’ Methods Inf. Med.,
vol. 58, no. 1, pp. 1–8, Jun. 2019.
REFERENCES
[24] T. A. Mohammed, S. Alhayli, S. Albawi, and A. Deniz Duru, ‘‘Intelligent
[1] A. Holzinger, ‘‘Interactive machine learning: Experimental evidence database interface techniques using semantic coordination,’’ in Proc. 1st
for the human in the algorithmic loop,’’ Appl. Intell., vol. 49, no. 7, Int. Sci. Conf. Eng. Sci.-3rd Sci. Conf. Eng. Sci. (ISCES), Jan. 2018,
pp. 2401–2414, 2019. pp. 13–17.
[2] T. A. Mohammed, A. Ghareeb, H. Al-Bayaty, and S. Aljawarneh, ‘‘Big [25] T. A. Mohammed, O. Bayat, O. N. Uçan, and S. Alhayali, ‘‘Hybrid Effi-
data challenges and achievements: Applications on smart cities and energy cient Genetic Algorithm for Big Data Feature Selection Problems,’’ Found.
sector,’’ in Proc. 2nd Int. Conf. Data Sci., E-Learn. Inf. Syst., 2019, p. 26. Sci., to be published.
[3] B. Cakici, K. Hebing, M. Grünewald, P. Saretok, and A. Hulth, ‘‘CASE: [26] T. A. Mohammed, S. Alhayali, O. Bayat, and O. N. Uçan, ‘‘Feature
A framework for computer supported outbreak detection,’’ BMC Med. reduction based on hybrid efficient weighted gene genetic algorithms with
Inform. Decis. Making, vol. 10, no. 1, p. 14, 2010. artificial neural network for machine learning problems in the big data,’’
[4] A. J. Vickers, T. Salz, E. Basch, M. R. Cooperberg, P. R. Carroll, F. Tighe, Sci. Program., vol. 2018, pp. 1–10, Oct. 2018.
and J. Eastham, and R. C. Rosen, ‘‘Electronic patient self-assessment and [27] T. Weilkiens, J. G. Lamm, S. Roth, and M. Walker, ‘‘B: The V-Model,’’
management (SAM): A novel framework for cancer survivorship,’’ BMC in Model-Based System Architecture. Hoboken, NJ, USA: Wiley, 2015,
Med. Inform. Decis. Making, vol. 10, no. 1, p. 34, 2010. pp. 343–352.
[5] A. Ismail, A. Shehab, and I. M. El-Henawy, ‘‘Healthcare analysis in [28] M. Al-Zewairi, M. Biltawi, W. Etaiwi, and A. Shaout, ‘‘Agile software
smart big data analytics: Reviews, challenges and recommendations,’’ in development methodologies: Survey of surveys,’’ J. Comput. Commun.,
Security in Smart Cities: Models, Applications, and Challenges, vol. 9, A. vol. 05, no. 05, pp. 74–97, 2017.
E. Hassanien, M. Elhoseny, S. H. Ahmed, and A. K. Singh, Eds. Cham, [29] Y. Zhou, ‘‘Predictive big data analytics using the UK Biobank data,’’ Sci.
Switzerland: Springer, Nov. 2019, pp. 27–45. Rep., vol. 9, no. 1, p. 6012, Dec. 2019.

VOLUME 8, 2020 23177

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2986259, IEEE Access
M. Moreb et al.: Novel Software Engineering Approach Toward Using Machine Learning for Improving the Efficiency

[30] A. J. Steele, S. C. Denaxas, A. D. Shah, H. Hemingway, and TAREQ ABED MOHAMMED received the B.Sc.
N. M. Luscombe, ‘‘Machine learning models in electronic health records degree in computer science from the College
can outperform conventional survival models for predicting patient mor- of Science, Kirkuk University, Kirkuk, Iraq,
tality in coronary artery disease,’’ PLoS ONE, vol. 13, no. 8, Aug. 2018, in 2007, the M.Sc. degree from Cankaya Univer-
Art. no. e0202344. sity, Ankara, Turkey, in 2012, and the Ph.D. degree
[31] W. Pearson, C. T. Tran, M. Zhang, and B. Xue, ‘‘Multi-round random in electronic and computer engineering from
subspace feature selection for incomplete gene expression data,’’ in Proc. Altinbas University, Istanbul, Turkey, in 2019.
IEEE Congr. Evol. Comput. (CEC), Jun. 2019, pp. 2544–2551.
In 2019, he started teaching at the College of
[32] S. Al-Janabi and A. F. Alkaim, ‘‘A nifty collaborative analysis to predicting
Computer Science and Information Technology,
a novel tool (DRFLLS) for missing values estimation,’’ Soft Comput.,
vol. 24, no. 1, pp. 555–569, Jan. 2020. University of Kirkuk. He has advised many studies
[33] J. Salvador-Meneses, Z. Ruiz-Chavez, and J. Garcia-Rodriguez, ‘‘Com- for M.Sc. and Ph.D. students at various universities, participated in many
pressed kNN: K-nearest neighbors with data compression,’’ Entropy, international conferences and contributed to various scientific studies.
vol. 21, no. 3, p. 234, Mar. 2019. OGUZ BAYAT received the B.S. degree from
[34] D. A. Clifton, J. Gibbons, J. Davies, and L. Tarassenko, ‘‘Machine learning Istanbul Technical University, Istanbul, Turkey,
and software engineering in health informatics,’’ in Proc. 1st Int. Workshop in 2000, the M.S degree from the University of
Realizing AI Synergies Softw. Eng. (RAISE), Jun. 2012, pp. 37–41. Hartford, CT, USA, in 2002, and the Ph.D. degree
[35] E. Boonchieng and K. Duangchaemkarn, ‘‘Digital disease detection: from Northeastern University, Boston, MA, USA,
Application of machine learning in community health informatics,’’ in
in 2006, all in electrical engineering.
Proc. 13th Int. Joint Conf. Comput. Sci. Softw. Eng. (JCSSE), Jul. 2016.
He completed the Executive Certificate Pro-
[36] J. Frochte and J. Frochte, ‘‘Python, NumPy, SciPy und Matplotlib—In a
nutshell,’’ in Maschinelles Lernen. Munich, Germany: Carl Hanser Verlag
gram in Technical Management and Leadership
GmbH, 2019, pp. 32–67. at Massachusetts Institute of Technology, Boston,
[37] W. Jentner and D. A. Keim, ‘‘Visualization and visual analytic tech- MA, USA, in 2009. Since 2011, he has been
niques for patterns,’’ in High-Utility Pattern Mining. Cham, Switzerland: serving as a Professor with the Department of Electrical and Electronics
Springer, 2019, pp. 303–337. Engineering, Altinbas University. He is also an Advisor to the President and
[38] M. I. Azeem, F. Palomba, L. Shi, and Q. Wang, ‘‘Machine learning tech- the Director of the Graduate School of Science and Engineering, Altinbas
niques for code smell detection: A systematic literature review and meta- University.
analysis,’’ Inf. Softw. Technol., vol. 108, pp. 115–138, Apr. 2019. OGUZ ATA received the BSc degree
Computer Engineering from Sakarya
University in 2004, and the MSc degrees
MOHAMMED MOREB was born in Hebron, Computer Engineering from Beykent
Palestine, in 1981. He received the B.Sc. degree in University in 2008, and PhD degrees in
information technology from Palestine Polytech- software engineering from Trakya
nic University, and the M.Sc. degree in computer Üniversitesi in 2012. He has been the head
science from Al-Quds University. He is currently of Department at Software Engineering at
pursuing the Ph.D. degree in electronic and com- Altinbas University and lecturer at
puter engineering with Altinbas University. Altinbas University. His research interests
The focus of his research is software engineer- include software repository mining, software measurement and testing,
ing in health informatics. He has over twelve years process improvement, and requirements engineering
of experience in managing software development
projects, including large government IT systems.

23178 VOLUME 8, 2020

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

You might also like