You are on page 1of 25

Data management in healthcare

[Document subtitle]

JUNE 4, 2021
Dataset name—Road safety
Table of Contents
Abstract …………………………………………………………………………………………………………………………………..

Introduction ……………………………………………………………………………………………………………………………..

Literature review ……………………………………………………………………………………………………………………..

Summary about Weka ……………………………………………………………………………………………………………..

Dataset Set: Anneal …………………………………………………………………………………………………………………

Methodology ………………………………………………………………………………………………………………………….

Dataset Pre-processing and Visualization ………………………………………………………………………………….

All Attributes Bar Graph ……………………………………………………………………………………………………………..

Distribution of Several Attributes ……………………………………………………………………………………………….

1. Road Type ……………………………………………………………………………………………………………………………..

2. Speed Limit …………………………………………………………………………………………………………………………..

3. Road Surface Condition ………………………………………………………………………………………………………..

4. Accident Severity ………………………………………………………………………………………………………………….

Clustering ………………………………………………………………………………………………………………………………….

Conclusion ………………………………………………………………………………………………………………………………..
Abstract
The purpose of the study is to show data analysis and visualization which may be utilized in
healthcare. Besides abilities acquired through platform development, there will be
strengthened solo work or teamwork abilities and report writing capabilities. The purpose of
this study is to explore how data mining is used to process medical and biological data. Based
on the selected data sets, several analyses are predicted.

Introduction
In this report we will get to know the role of data analytics and visualization in
healthcare and how data analytics and health data visualization can help in
healthcare.
With regards to the health care framework, which is progressively data-dependent,
data analytics can help determine bits of knowledge on foundational wastage of
assets or resources can follow individual expert execution, and can even track the
strength of population and distinguish individuals at risk for chronic disease.
Visualizing health data is an amazing method to share urgent health information
viably. (Chen et al, 2020) Information representation the way toward breaking down a
lot of information and imparting the outcomes in a visual setting—is an ordinarily
utilized apparatus in the current period of enormous data. Presently, hospitals are
looking to use data perception to augment efficiencies and present wellbeing related
findings.
Literature review
Healthcare is one of the most significant sectors of data analysis. In reality, health analytics
may lower treatment costs, anticipate epidemic breakouts, prevent avoidable illnesses, and
enhance overall quality of life. Data analytics have developed a web-based user interface to
estimate patient loads and plan the allocation of resources by using online data visualization
to improve overall care for patients. The most prevalent application of big data analytics in
medicine is the electronic health record. Each patient has their own digital record, including
demographics, medical history, allergies lab test results and so on. Records are exchanged
through secure information networks for both public and private sector providers. Each
record contains one modified file, which implies that physicians can modify over time
without documentation and without any risk of data duplication.

Many customers – and hence future patients – are already interested in smart gadgets that
record each step, their heart rates, sleep habits, etc. All this essential information may be
combined with additional track-able information for the identification of potential health
hazards. For example, a risk of future cardiac conditions may be indicated by chronic
sleeplessness and high heart rate. Patients are directly involved in health monitoring, and
health insurance incentives can encourage them towards a healthy lifestyle (e.g.: giving
money back to people using smart-watches). A big data analysis application for medicine
could be the solution everyone seeks: Blue Cross Blue Shield's data scientists have started
working with Fuzzy Logix analytics professionals to solve the challenge. With the use of
years of insurance and pharmacy data, analysts at Fuzzy Logix have found 742 risk indicators
that accurately predict if anybody is at risk for opiate abuse.

Based on greater insights into people's motives, the use of big data in healthcare permits
strategic planning. Care managers can examine the outcomes of inspections between persons
in different population groups and determine the reasons which prevent individuals from
receiving treatment. The aim of online business intelligence healthcare is for physicians to
make data-driven choices in seconds and enhance treatment for patients. This is particularly
effective for those suffering from various illnesses with complex medical history. For
example, new BI solutions and tools might forecast who is at risk of diabetes and hence
encouraged to apply additional weight control or screening.
Summary about Weka
Weka is an open-source GUI data mining and analysis tool. It provides GUI support
for various data pre-processing and analytical techniques. It also provides several
supervised and unsupervised machine learning models. There is support for both
classification and regression models. The provided models are various statistical and
machine learning models. Unsupervised models like K-Means Clustering,
Hierarchical clustering, etc are also available. It also provided methods for attribute
selection, e.g., selecting top/best attributes out of all attributes available in the
dataset. There are various ranking methods to select variables which can explain or
provide most information with respect to some target variable. We will use and
exploit all the features provided by Weka for data analysis. The Weka is a very easy
to use and easy to learn tool and the GUI makes it much easier to learn the tool. The
weka can be simply downloaded and installed for any operating system. We
download it on our operating system and installed it. No particular setup or user
creation is needed for the installation.

We will use weka for both Visual Analysis(SectionA) and data


exploration/mining(Section2).
Section A

Exploration and Visualization

Dataset Name: Anneal

Dataset Pre-processing and Visualization:


The dataset contains 39 variables. The target variable is a classification variable with 6
classes. Total Number of samples is 898. Here are the basic descriptions and details of each
attribute of the data.

1. Family
This is a categorical column. It is expected to contain a total of 9 classes, most of the data is
unlabelled or not properly labelled. There are 67 samples which contain the “TN” label, and
59 samples which contain “ZS” label, all others are noisy or unlabelled entries. Here is the
bar graph of frequencies of all categories.
1. Product Type
All the labels contain single class “C”. There are two other possible classes, H and G, but
there are no samples containing those.

2. Steel

There are 86 missing or noisy values. Apart from those, we have 8 different classes for this
variable.
1. Carbon
It is a numerical attribute. It ranges from 0 to 70 (max value is 70, and min is 0). The mean
value is 3.635 and standard deviation is 13.717. The majority of values lies in range 0-5 for
this attribute.
5. Hardness

It is a numerical attribute. It ranges from 0 to 85 (max value is 85, and min is 0). The
mean value is 11.776 and standard deviation is 24.751. This attribute is more
distributed compared to the previous one, still the majority of values lie in the initial
bar.

Observations/Result
The picture below contains the basic description figures for all variables. If we look at the
distribution of the variables in the dataset, there is a very little or no variation in most of
them. We will try to pick top 10 attributes which can explain/predict the target variable.
The top 10 ranked attributes which can best explain the target columns are:
1. surface-quality
2. thick
3. family
4. steel
5. formability
6. hardness
7. condition
8. width
9. temper rolling
10. Non-ageing

We also see in the above graph that these variables are the ones which have significant
variation in the dataset, and significant effect on target columns. We will use these variables
along with target class for further analysis and will drop others.
The figure below shows relation between each pair of the variable/attributes.
Introduction ……………………………………………………………………………………………………………………………..

Clustering ………………………………………………………………………………………………………………………………….

Classification …………………………………………………………………………………………………………………………….

Summary of the Model …………………………………………………………………………………………………………….

Conclusion/Observations for Anneal Dataset ……………………………………………………………………………

Dataset 1 (Road Safety) …………………………………………………………………………………………………………….

Dataset Pre-processing and Visualization ………………………………………………………………………………….

All Attributes Bar Graph ……………………………………………………………………………………………………………..

Distribution of Several Attributes ……………………………………………………………………………………………….

1. Road Type ……………………………………………………………………………………………………………………………..

2. Speed Limit …………………………………………………………………………………………………………………………..

3. Road Surface Condition ………………………………………………………………………………………………………..

4. Accident Severity ………………………………………………………………………………………………………………….

Clustering ………………………………………………………………………………………………………………………………….

Results and Analysis ………………………………………………………………………………………………………………..

Conclusion ………………………………………………………………………………………………………………………………..

References ………………………………………………………………………………………………………………………………..
SECTION B

We will analyse the previous (Anneal) dataset along with Road safety dataset
in this section.
The visual analysis and basic exploration for Anneal dataset is already done
in section A. So, we will start with clustering and classification.

Clustering
Clustering algorithm shows there are 10 cluster instances.
Clustered Instances are below given.
0 86 (10%)
1 58 (6%)
2 28 (3%)
3 67 (7%)
4 23 (3%)
5 160 ( 18%)
6 105 ( 12%)
7 114 ( 13%)
8 138 ( 15%)
9 119 ( 13%)
Log likelihood: -12.46102

Classification

The logistic regression provided accuracy of 98.8% when trained on the data using the top
10 variables selected above. The figure below shows the summary of the model.
Support Vector Classifier performed equally well, although a bit low accuracy was observed,
the accuracy was 97.2%.
figure below shows the summary of the model.

Conclusion/Observations for Anneal Dataset


The top 10 variables among 38 attributes can explain most of the variance in the data and
can be used to predict the target class. The most of attributes in the data re either missing
or same for most of the data. The target can be predicted using those variables with an F1
score of 98.8 even by linear models like logistic regression.
Dataset Name: Road Safety

The UK government collects and publishes (usually on an annual basis) detailed information
about traffic accidents across the country. This information includes, but is not limited to,
geographical locations, weather conditions, type of vehicles, number of casualties and vehicle
manoeuvres, making this a very interesting and comprehensive dataset for analysis and
research. The creation of this dataset was inspired by the one previously published by Dave
Fisher-Hickey. However, this current dataset features the following significant improvements
over its predecessor:

 It covers a wider date range of events.


 Most of the coded data variables have been transformed to textual strings using
relevant lookup tables, enabling more efficient and "human-readable" analysis.
 It features detailed information about the vehicles involved in the accidents.

The data come from the Open Data website of the UK government, where they have been
published by the Department of Transport.

The dataset comprises of two csv files:

 AccidentInformation.csv: every line in the file represents a unique traffic accident


(identified by the Accident Index column), featuring various properties related to the
accident as columns. Date range: 2005-2017
 Vehicle_Information.csv: every line in the file represents the involvement of a
unique vehicle in a unique traffic accident, featuring various vehicle and passenger
properties as columns. Date range: 2004-2016

The two above-mentioned files/datasets can be linked through the unique traffic accident
identifier (Accident Index column). The dataset will keep being updated as more data become
available by the Department of Transport.
Dataset Pre-processing and Visualization:
The dataset contains a total 26 attributes. total 19 samples are available in the dataset. The
figure below shows bar graphs for all attributes.

As we can see, there are only a few variables which don't have much variation. So,
we can use all these variables for analysis and prediction.

Let us discuss distribution of several attributes, which seems important in affecting


the accidents:

1. Road Type
Road quality is quite a big and governing factor in the safety of the roads. The
minimum and maximum values of the scale available in the dataset were 2 and 6
respectively. Most of the accidents were reported on Road Type 4-6.
2. Speed Limit
Speed limit is one of the major governing factors for road safety. The low-speed
roads are the roads which are riskier and driving with high speed on those roads can
cause a lot of accidents. Most of the accidents in the dataset reported accidents on
low-speed limit roads. The major reason behind such a pattern could be the riskier
nature of low-speed limit roads; anyone driving with high speed on those can face
accidents.
3. Road Surface Condition
Road surface conditions also highly affect the probabilities of accidents. Most of the
accidents were observed on medium level road surfaces.
4. Accident Severity
The accident severity was high for most of the accidents in this dataset.
Clustering

Two clusters were found with a Log-Likelihood score of -33.48.

Results and Analysis:


Studies have showed that there has been a data violation of 93 percent of healthcare
organizations. Two clusters were found with a Log-Likelihood score of -33.48. The rationale
is simple: personal information on illegal marketplaces is very valuable and profitable. And
any violation would have serious implications. In order to avert security concerns, several
companies began using analytics to recognize changes in network traffic or other behaviours
that reflected a cyber assault. In this respect big data of course, has inherent security
problems, and many believe that employing them would make businesses susceptible. But
security improvements like encryption technology, firewalls, anti-virus software etc. respond
to the demand for more security.

It might also aid in systemically and repeatably avoid fraud and erroneous claims. Analytics
assist simplify insurance claims processing, allowing patients to obtain higher returns on their
claims and caregivers to be reimbursed more quickly. Telemedicine has been on the market
for over 40 years, but it has been able to flourish only today with the emergence of on-line
video conferencing, cell phones, Wi-Fi and wearables. The word relates to the provision of
technological remote clinical services. It is utilized in basic and first consultations, remote
patient surveillance and healthcare professionals' education. Tele surgery has some more
particular application — physicians may execute surgeries using robots and fast real-time
data provision without being physically in the same place with a patient. Clinicians utilize
telemedicine to deliver tailored therapy programs and avert admission or hospitalization.
Such application of health analysis can be related to the previously seen application of
predictive analytics. It helps practitioners to anticipate in advance acute medical occurrences
and to prevent patient conditions from deteriorating. Telemedicine can save expenses and
enhance service quality by keeping patients away from hospitals. Patients can prevent
waiting, and physicians do not spend their time on unneeded consultations and red tape. The
availability of care is enhanced via telemedicine as the condition of patients may be watched
and consulted anywhere and anytime.

Medical imaging is crucial and over 600 million processes are conducted every year in the
United States. Analysis and manual storage of these photos costs both time and money, as
each picture needs to be examined by radiologists, while hospitals need to preserve it for
several years. Provider of medical imagery Care stream discusses how Big Data Analytics
may affect photos reading: algorithms that have been built to analyze hundreds of thousands
of photographs may discover certain patterns in the pixels and turn it into a number that may
assist the doctor diagnose. They go even farther, arguing that radiologists might not have to
view photos any more, but rather examine the results of algorithms, which would surely study
and recall more pictures than they could in their lifetime. This would surely affect the job,
education and capabilities of radiologists.
Conclusions/Observation (Road Safety Dataset)
Roads with types 4-6 had more accidents. It may be due to speed limits and traffic
volume at those roads. Low speed limit roads encountered more accidents, the
reason could be complex geographical or maintenance conditions of the roads due
to which speed limits were lowered and anyone moving with high speed on those
suffered the consequences. Although we cannot assert the later statement due to
lack of data. Medium level Surface condition resulted in more compared to good and
bad conditions. The reason may be that for most of the roads the surface quality is
medium or medium quality surfaces can provoke someone to high speed but
handling becomes complex at turns or quick collision conditions.

References
Chen, P. T., Lin, C. L., & Wu, W. N. (2020). Big data management in healthcare: Adoption
challenges and implications. International Journal of Information Management, 53, 102078.
Benhlima, L. (2018). Big data management for healthcare systems: architecture,
requirements, and implementation. Advances in bioinformatics, 2018.
Shakil, K. A., Zareen, F. J., Alam, M., & Jabin, S. (2020). BAMHealthCloud: A biometric
authentication and data management system for healthcare data in cloud. Journal of King
Saud University-Computer and Information Sciences, 32(1), 57-64.
Nazir, S., Khan, S., Khan, H. U., Ali, S., García-Magariño, I., Atan, R. B., & Nawaz, M. (2020). A
comprehensive analysis of healthcare big data management, analytics and scientific
programming. IEEE Access, 8, 95714-95733.
Parast, M. M., & Golmohammadi, D. (2019). Quality management in healthcare
organizations: empirical evidence from the baldrige data. International Journal of
Production Economics, 216, 133-144.
Source: https://www.cs.waikato.ac.nz/ml/weka/

You might also like