Professional Documents
Culture Documents
1. INTRODUCTION
The field of health informatics (HI) aims to provide a large-scale linkage among disparate
ideas. Normally, a healthcare dataset is found to be incomplete and noisy; as a result, reading
data from dataset linkage traditionally fails within the discipline of software engineering.
Machine learning (ML) is a rapidly maturing branch of computer science since it can store
data on a large scale. Many ML tools can be used to analyze data and yield knowledge that
can improve the quality of work for both staff and doctors; however, for developers, there is
currently no methodology that can be used. Regarding software engineering, there has been a
lack of approaches to evaluating which software engineering tasks are better performed by
automation and which require human involvement or human-in-the-loop approaches [1]. Big
data has many challenges regarding analysis challenges for real-world big data [2], including
OLAP mass data, mass data protection, mass data survey and mass data dissemination.
Recently, a set of frameworks have been used to develop data analysis tools such as Win-
CASE [3] and SAM [4]. The market has vast data analysis tools that can discover interesting
patterns and hidden relationships to support decision makers [5]. BKMR used the R package
as a statistical approach on health effects to estimate the multivariable exposure-response
function [6].
Augmentor included the Python image library for augmentation [7], while for the
visualization of medical treatment plans and patient data, CareVis was used [8], as it was
designed for this task. Other applications require a visual interface using COQUITO [9]. For
health-care data analytics, the widely known 3P tools [10] were used. Many simple
applications, such as WEKA, which provided a GUI for many machine learning algorithms
[11], while Apache Spark was used for the cluster computing framework [12], are powerful
systems that can used in various applications for solving problems using big data and
machine learning [13]. Table 1 summarizes the main tools used for big data in analytics
according with respect to the task. Software engineering for machine learning applications
(SEMLA) discusses the challenges, new insights, and practical ideas regarding the
engineering of ML and artificial engineering (AI) [14]. NSGA-II proposed algorithms for
real-world applications that include more than one objective function for enhancing
performance in terms of both diversity and convergence [15]. ML algorithms in clinical
A Novel Software Engineering Approach Toward using Machine Learning for Improving the Efficiency of Health Systems
genomics generally come in three main forms: supervised, unsupervised and semi-supervised
[16]. Interflow system requirement analysis (ISRA) has been used to determine the system
requirements [17].
Machine learning is a field of software engineering that frequently utilizes factual procedures
to enable PCs to “learn” by using information from saved datasets. Unsupervised or
information mining focuses more on exploratory information investigation and is known as
learning supported by data analytics. Patient laboratory test queue management and wait time
prediction are a challenging and complicated job. Because each patient might require
different phase operations (tasks), such as a check-up, various tests, e.g., a sugar level test or
blood test, X-rays or surgery, each task can consider different medical tests, from 0 to N , for
each patient according to their condition.
In this article, based on a grounded theory methodology [21], the researchers proposed a
novel methodology, SEMLHI, in developing a framework by defining the research problem
and methodology for the developers. The SEMLHI framework includes a theoretical
framework to support research and design activities that incorporate existing knowledge. The
SEMLHI framework was composed of four components that help developers observe the
health application flow from the main module to submodules to run and validate specific
tasks. This enables multiple developers to work on different modules of the application
simultaneously. The SEMLHI framework supports the methodological approach to
conducting research on health informatics. It also supports a structure that presents a common
set of ML terminology to use, compare, measure, and design software systems in the area of
health. This creates a space whereby SE and ML experts can work on a specific
methodological approach to enable health informatics software development teams to
integrate the ML model lifecycle. Our methodology was applicable to current systems or in
the development of new systems that use the ML module for current systems, which can be
used in regular updates to add data to the system, to perform irregular updates and to add new
features such as new versions of ICD diagnosis codes, minor model improvements for bug
fixes, new functionalities required by the client, and new hardware or architectural
constraints.
1.1 Objectives
Normally, a healthcare dataset is found to be incomplete and noisy; as a result, reading data
from dataset linkage traditionally fails within the discipline of software engineering. Machine
learning (ML) is a rapidly maturing branch of computer science since it can store data on a
large scale. Many ML tools can be used to analyse data and yield knowledge that can
improve the quality of work for both staff and doctors; however, for developers, there is
currently no methodology that can be used. Regarding software engineering, there has been a
lack of approaches to evaluating which software engineering tasks are better performed by
automation and which require human involvement or human-in-the-loop approaches. Big
data has many challenges regarding analysis challenges for real-world big data, including
OLAP mass data, mass data protection, mass data survey and mass data dissemination.
CHAPTER-2
2. LITERATURE REVIEW
Instant access.
Improved productivity.
CHAPTER-3
3. SYSTEM ANALYSIS
Analysis is a logical process. The objective of this phase is to determine exactly what must be
done to solve the problem. Tools such as Class Diagrams, Sequence diagrams, data flow
diagrams and data dictionary are used in developing a logical model of system.
A software life cycle model (also termed process model) is a pictorial and diagrammatic
representation of the software life cycle. A life cycle model represents all the methods
required to make a software product transit through its life cycle stages. It also captures the
structure in which these methods are to be undertaken.
A life cycle model maps the various activities performed on a software product from its
inception to retirement.
There are different software development life cycle models specify and design, which are
followed during the software development phase. These models are also called "Software
Development Process Models." Each process model follows a series of phase unique to its
type to ensure success in the step of software development.
Waterfall Model
RAD Model
Spiral Model
Incremental Model
Iterative Model
Among all these models’ spiral model is the one of the best models.
Spiral Model
This SDLC model helps the group to adopt elements of one of more process models like a
waterfall, incremental. The spiral technique is a combination of rapid prototyping and
concurrency in design and development activities. Each cycle in the spiral begins with the
identification of objective for that cycle.
All projects are feasible when provided with unlimited resources and infinite time
unfortunately, the development of computer-based system or product is more likely plagued
by a scarcity of resources and difficult delivery dates. It is both necessary and prudent to
evaluate the feasibility of a project at the earliest possible time. Months or years of effort,
thousands or millions of dollars, and untold professional embarrassment can be averted if an
ill-conceived system is recognized early in the definition phase.
Feasibility and risk analysis are related in many ways. If project risk is great the feasibility of
producing quality software is reduced. During product engineering, however, we concentrate
our attention on four primary areas of interest.
This application in going to be used in an Internet environment called www (World Wide
Web). So, it is necessary to use a technology that is capable of providing the networking
facility to the application. This application as also able to work on distributed environment.
GUI is developed using HTML to capture the information from the customer. HTML is used
to display the content on the browser. It uses TCP/IP protocol. It is an interpreted language.
The economical issues usually arise during the economical feasibility stage are whether the
system will be used if it is developed and implemented, whether the financial benefits are
equal are exceeds the costs. The cost for developing the project will include cost conducts full
system investigation, cost of hardware and software for the class of being considered, the
benefits in the form of reduced costs or fewer costly errors.
In our application front end is developed using GUI. So, it is very easy to the customer to
enter the necessary information. But customer must have some knowledge on using web
applications before going to use our application.
Functional requirements describe what the system should do. The functional
requirements can be further categorized as follows:
The input design is the link between the information system and the user. It comprises the
developing specification and procedures for data preparation and the steps are necessary to
put transaction data in to a usable form for processing that can be achieved by inspecting the
computer to read data from a written or printed document or it can occur by having people
keying the data directly into the system.
Non-functional requirements are the constraints that must be adhered during development.
They limit what resources can be used and set bounds on aspects of the software’s quality.
5 Members can complete the project in 2 – 4 months if they work fulltime on it.
3.4 Modules
3.4.1 Health Informatics Data
To predict any disease we need to build Machine Learning models by using datasets and this
datasets often contains missing data, null and non-numeric data and this type of data could
degrade ML prediction accuracy and to overcome from this problem author is applying
PREPROCESSING on health care data to remove all missing and null values and then
convert non-numeric data to numeric data by applying python SKLEARN
PREPROCESSING classes. Often this dataset may contains unnecessary columns or
attributes and to remove this attributes author applying dimensionality reduction algorithm
called PCA. PCA (principal component analysis) remove unnecessary attributes from dataset
and maintain only important attributes necessary to make correct prediction.
3.4.2 ML Algorithms:
In this module author using various machine learning algorithms such as Linear SVC,
Multinomial Naïve Bayes, Random Forest, Logistic Regression and KNN. This algorithm
train itself with available datasets and then generate a train model and then this train model
will be applied on new test data to perform prediction. By using above algorithms we can
make machine to learn and perform prediction without any human supports.
Once after building above models then we can apply new test data on this model to predict
whether patient lab reports are positive or negative.
3.4.4 Software:
This module used by developers to check reliability of above modules by applying software
quality check, UNIT TESTING and software verification.
tutorial, we highlight an explanation based on the representation bias. The naive bayes
classifier is a linear classifier, as well as linear discriminant analysis, logistic regression or
linear SVM (support vector machine). The difference lies on the method of estimating the
parameters of the classifier (the learning bias).
While the Naive Bayes classifier is widely used in the research world, it is not widespread
among practitioners which want to obtain usable results. On the one hand, the researchers
found especially it is very easy to program and implement it, its parameters are easy to
estimate, learning is very fast even on very large databases, its accuracy is reasonably good in
comparison to the other approaches. On the other hand, the final users do not obtain a model
easy to interpret and deploy, they does not understand the interest of such a technique.
Thus, we introduce in a new presentation of the results of the learning process. The classifier
is easier to understand, and its deployment is also made easier. In the first part of this tutorial,
we present some theoretical aspects of the naive bayes classifier. Then, we implement the
approach on a dataset with Tanagra. We compare the obtained results (the parameters of the
model) to those obtained with other linear approaches such as the logistic regression, the
linear discriminant analysis and the linear SVM. We note that the results are highly
consistent. This largely explains the good performance of the method in comparison to
others.
Example: Suppose, we have an image of a creature that looks similar to cat and dog, but
we want to know either it is a cat or dog. So for this identification, we can use the KNN
algorithm, as it works on a similarity measure. Our KNN model will find the similar
features of the new data set to the cats and dogs images and based on the most similar
features it will put it in either cat or dog category.
The first algorithm for random decision forests was created in 1995 by Tin Kam Ho[1] using
the random subspace method, which, in Ho's formulation, is a way to implement the
"stochastic discrimination" approach to classification proposed by Eugene Kleinberg.
An extension of the algorithm was developed by Leo Breiman and Adele Cutler, who
registered "Random Forests" as a trademark in 2006 (as of 2019, owned by Minitab,
Inc.).The extension combines Breiman's "bagging" idea and random selection of features,
introduced first by Ho[1] and later independently by Amit and Geman[13] in order to
construct a collection of decision trees with controlled variance.
Random forests are frequently used as "blackbox" models in businesses, as they generate
reasonable predictions across a wide range of data while requiring little configuration.
UML Introduction
The unified modelling language allows the software engineer to express an analysis model
using the modelling notation that is governed by a set of syntactic, semantic and pragmatic
rules. A UML system is represented using five different views that describe the system from
distinctly different perspective.
1) UML Analysis modelling, this focuses on the user model and structural model views
of the system.
2) UML design modelling, which focuses on the behavioural modelling, implementation
modelling and environmental model views.
GOALS:
The Primary goals in the design of the UML are as follows:
1. Provide users a ready-to-use, expressive visual modeling Language so that they can
develop and exchange meaningful models.
2. Provide extendibility and specialization mechanisms to extend the core concepts.
3. Be independent of particular programming languages and development process.
4. Provide a formal basis for understanding the modeling language.
5. Encourage the growth of OO tools market.
6. Support higher level development concepts such as collaborations, frameworks,
patterns and components.
7. Integrate best practices.
System design aspects
Once the analysis stage is completed, the next stage is to determine in broad outline form
how the problem might be solved. During system design, we are beginning to move from the
logical to physical level. System design involves architectural and detailed design of the
system. Architectural design involves identifying software components, decomposing them
into processing modules and conceptual data structures, and specifying the interconnections
among components.
Two kinds of approaches are available:
Top-down approach
Bottom-up approach
Design of the code
Since information systems projects are designed with space, time and cost saving in mind,
coding methods in which conditions, words, ideas or control errors and speed the entire
process. The purpose of the code is to facilitate the identification and retrieval of the
information. A code is an ordered collection of symbols designed to provide unique
identification of an entity or an attribute.
Design of input
Design of input involves the following decisions.
Input data
Input medium
The way data should be arranged or coded
Validation needed to detect every step to follow when error occurs.The input controls provide
ways to ensure that only authorized users access the system guarantee the valid transactions,
validate the data for accuracy and determine whether any necessary data has been omitted.
The primary input medium chosen is display. Screens have been developed for input of data
using HTML.
Design of output
Information to present
Output medium
Output layout
Output of this system is given in easily understandable, user-friendly manner, Layout of the
output is decided through the discussions with the different users.
Design of control
The system should offer the means of detecting and handling errors. Input controls provides
ways to:
As the strategic value of software increases for many companies, the industry looks for
techniques to automate the production of software and to improve quality and reduce cost and
time-to-market. These techniques include component technology, visual programming,
patterns and frameworks.
Businesses also seek techniques to manage the complexity of systems as they increase in
scope and scale. In particular, they recognize the need to solve recurring architectural
problems, such as physical distribution, concurrency, replication, security, load balancing and
fault tolerance.
The Unified Modelling Language (UML) was designed to respond to these needs. Simply,
Systems design refers to the process of defining the architecture, components, modules,
interfaces, and data for a system to satisfy specified requirements which can be done easily
through UML diagrams. In the project four basic UML diagrams have been explained among
the following list:
Class Diagram
Use Case Diagram
Sequence Diagram
Activity Diagram
3.7.1 Class Diagram
In software engineering, a class diagram in the Unified Modelling Language (UML) is a type
of static structure diagram that describes the structure of a system by showing the system's
classes, their attributes, and the relationships between the classes. This is one of the most
important of the diagrams in development. The diagram breaks the class into three layers.
The relationships are drawn between the classes. Developers use the Class Diagram to
develop the classes. Analyses use it to show the details of the system.
In software engineering, a use case diagram in the Unified Modelling Language (UML) is a
type of behavioural diagram defined by and created from a Use-case analysis. Its purpose is
to present a graphical overview of the functionality provided by a system in terms of actors,
their goals (represented as use cases), and any dependencies between those use cases.
The main purpose of a use case diagram is to show what system functions are performed for
which actor. Roles of the actors in the system can be depicted. Use cases are used during
requirements elicitation and analysis to represent the functionality of the system. Use cases
focus on the behaviour of the system from the external point of view.
user system
1 : start()
2 : GrapeCS-ML Database()
10 : end()
The process flows in the system are captured in the activity diagram. Similar to a state
diagram, an activity diagram also consists of activities, actions, transitions, initial and final
states, and guard conditions.
A collaboration diagram groups together the interactions between different objects. The
interactions are listed as numbered interactions that help to trace the sequence of the
interactions. The collaboration diagram helps to identify all the possible interactions that each
object has with other objects.
Programming : Python
RAM : 4 GB