Improving Healthcare Services Using Source Anonymous Scheme With Privacy Preserving Distributed Healthcare Data Collection and Mining

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/346057733
Improving healthcare services using source anonymous scheme with privacy

preserving distributed healthcare data collection and mining
Article in Computing · January 2021

DOI: 10.1007/s00607-020-00847-0
CITATIONS READS
2 84
2 authors:
Nikunj Domadiya Udai Pratap Rao

Sardar Vallabhbhai National Institute of Technology Sardar Vallabhbhai National Institute of Technology Surat (Gujarat) India
8 PUBLICATIONS 69 CITATIONS 54 PUBLICATIONS 434 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Privacy Preserving Big Data Publishing View project
Privacy in Healthcare Data Mining View project
All content following this page was uploaded by Nikunj Domadiya on 13 August 2021.
The user has requested enhancement of the downloaded file.

Computing
https://doi.org/10.1007/s00607-020-00847-0
REGULAR PAPER
Improving healthcare services using source anonymous

scheme with privacy preserving distributed healthcare data
collection and mining
Nikunj Domadiya1 · Udai Pratap Rao1
Received: 12 April 2020 / Accepted: 28 September 2020

© Springer-Verlag GmbH Austria, part of Springer Nature 2020
Abstract
The trends of data mining on healthcare data for improving medical services have
increased because of the electronic healthcare record(EHR) system, which collects a
massive amount of data on a daily basis. In the current scenario, hospital maintains its
EHR system and stores the detailed information of patients. Data mining for health-
care improvement requires the data from all the EHR systems located at a different
location to be stored at the central data mining server. Collection of healthcare data
at some untrusted central data mining server raises privacy threats. Healthcare data
contains patients’ private information and sharing this information for data mining
creates privacy issues. Most of the previous research either focused on k-anonymity
technique which causes information loss and decreases data mining accuracy or pri-
vacy preserving data mining which is focused on only specific data mining technique.
We adopt source anonymous technique as privacy preserving scheme and present a
novel scheme for healthcare data collection and mining in this paper. Our scheme
collects data from all EHR systems without any information loss and stores at a single
central data mining server, also ensuring privacy is preserved. Central data mining
server helps to analyze the collected data with different data mining techniques (Asso-
ciation rule mining, Classification, Clustering, etc.) without the involvement of EHR
systems. Our scheme is collusion resilient against central data mining server and EHR
systems. Theoretical and experimental analysis show the efficiency of our scheme in
terms of computation and communication cost. The experimental results using Heart
disease dataset show the advantage to EHR systems using the proposed approach in
terms of disease prediction accuracy.
B Nikunj Domadiya
domadiyanikunj002@gmail.com
Udai Pratap Rao
upr@coed.svnit.ac.in
1 Department of Computer Engineering, Sardar Vallabhbhai National Institute of Technology,

Surat, India
123
N. Domadiya, U. P. Rao
Keywords Healthcare · Data Mining · Privacy · Source Anonymous · Privacy

Preserving Data Mining · Healthcare Improvement
Mathematics Subject Classification 68P20 · 68P27 · 92C50
1 Introduction
In this digital world, data analysis becomes an important research area in major
domains like healthcare, banking, education, business, etc. for improving some ser-
vices related to a specific domain. Data mining techniques(e.g. Association Rule
Mining, Classification, Clustering, etc.) are majorly used for analyzing the data to
extract the hidden patterns. Major private or government organizations of the same or
different domains used to collaborate to perform the data mining on combined data
from all collaborative organizations (or participants) to extract some useful knowl-
edge for mutual benefits. Due to the distributed environment among the collaborative
participants, individual data must be shared by each collaborative participant for data
mining. While these patterns provide useful knowledge to a specific domain, but raises
the issue of privacy due to the requirement of sharing private or individual data. Hence,
Privacy preserving distributed data mining becomes a renowned research area. There-
fore, we propose a scheme for distributed data mining while preserving privacy. In
this paper, we have explored the healthcare domain as an application of a proposed
scheme but it can be applied in other domains also.
The increasing trends of electronic healthcare record (EHR) system in major
healthcare centres generates a tremendous amount of information about patients [1].
Efficiency and quality of healthcare can be improved using data mining on the patients’
data. Some healthcare areas like a prescription of medicine, patients relationship man-
agement, the effectiveness of treatments, fraud identifications can be enhanced with
data mining [2–6,13]. Data mining helps to discover most effective action for certain
diseases by mining patients’ data like symptoms, causes and treatment stored in EHR
system [7,8]. This can help in reducing healthcare expenses and suggest the standard
treatment for the particular diseases [9,10]. Hence, healthcare treatment and diagnosis
become faster and more efficient. Data mining also helps to discover some outlier
cases of a fraudulent medical claim by physicians, clinic, labs etc. [11]. In 2015, texas
Medicaid expenditure was classified as improper of $29 billion (9.6%). The Texas
Medicaid Fraud and Abuse Detection System assist some suspects for investigation
and recovered millions in stolen funds using data mining [12].
Currently, physicians in hospitals can access records from their EHR system.
Healthcare data for data mining is also limited to a single EHR system. Every EHR
system might not be capable of handling the data mining task and has limited patients’
data. For better accuracy of data mining, data must be aggregated from different EHR
systems and to be stored at central data mining server or cloud with high configuration.
Data among various organizations is distributed either Horizontally partitioned or Ver-
tically partitioned, as shown in Fig. 1. In horizontally partitioned data, the schema of
different organization/participants is same, but each organization stores information
123
Improving healthcare services using source anonymous…
Fig. 1 Example of horizontally and vertically partition healthcare Data [13]
Table 1 Healthcare Privacy Laws in some Countries [20]
Law Country
DISHA (The Digital Information Security in Healthcare Act) INDIA

HIPPA (Health Insurance Portability and Accountability Act) USA
PIPEDA (Personal Information Protection and Electronic Canada
Documents Act)
DPA (Data Protection Act) UK
HIPC (Health Information Privacy Code) New Zealand
about separate entities. In vertically partitioned data, each organization has a different
schema but stores information about the same entities.
Advanced data mining on aggregated healthcare data can help to improve patients’
care. This prerequisite the sharing of patients data from all EHR systems (or participant
system) which are interested in data mining task [14,15]. Sharing of patients data
violates the privacy in healthcare act [16,17]. Privacy of patients’ information stored
in local EHR systems must be maintained as disclosure of personal health information
may cause personal, social or economic harm to patients [18,19]. Nowadays, major
countries have considered it as a serious issue and created different laws and policies
for healthcare data privacy. Healthcare data privacy protection laws of some countries
are shown in Table 1 [20]. Hence, privacy issues must be focused while mining the
distributed healthcare data.
Advancement in healthcare with data mining requires aggregation of data from
different EHR systems while ensuring privacy preservation [21]. Therefore, healthcare
research is focused on privacy preserving distributed data collection and mining.
1.1 Literature review
Existing solutions of this problem are divided into two major categories as follows:
Privacy Preserving Data Publishing (PPDP) using k-anonymity and Privacy Preserving
Data Mining (PPDM) without data publishing. k-anonymity technique modifies the
123
original data and converts into k-anonymous [17]. k-anonymity technique generalizes
the values of selected attributes to preserve the privacy. Existing solutions using k-
anonymity are discussed in [17,22–29]. This generalized data is available publicly
for data mining without violating privacy. It decreases the accuracy of data mining
as modified/generalized data is used for data mining algorithms. Healthcare decisions
require better accuracy for some critical diseases (heart attack, cancer, etc). Hence,
original data must be used for data mining results. Therefore, k-anonymity technique
cannot be applied on aggregated data for data mining of critical healthcare data.
In other technique, Privacy Preserving Data Mining (PPDM) works without pub-
lishing data using some cryptography techniques [30–32].
Murat et al. [33] proposed the approach Privacy Preserving Distributed Association
Rule Mining (PPDARM) on horizontally partitioned data. This design is analyzed for
different privacy vulnerabilities, and different techniques instead of random number
addition are proposed. Any external attacker or collaborative participants can trace the
received and sent information of targeted participants. By computing the difference
of these values, he/she can learn the private value of targeted participant. Hence, the
private information of any participant might be disclosed, which interns violate the
privacy requirements.
M. Hussein et al. [34] proposed an approach with different design related to commu-
nication among the participants for PPDARM on horizontally partitioned data. They
used public-key cryptography instead of adding a random number scheme. However,
the collision among the initiator and combiner cause privacy vulnerabilities.
Nirali et al. [35] presented PPDARM approach on horizontally partitioned data
using Shamir’s secret sharing [36] technique. They compute the global count of all
candidate itemsets, which causes higher communication and computation cost. Pail-
lier homomorphic encryption-based algorithm for privacy preserving association rule
mining is discussed in [37]. However, the computation cost of these algorithms is very
high.
Chahar et al. [38] proposed a scheme for PPDARM on horizontally partitioned data
in an insecure environment. They used efficient communication sequences among the
participants to reduce the communication overhead. In this scheme, one of the partici-
pants called minier can trace all the communication among the participants during the
execution of an algorithm due to insecure communication environment among the col-
laborative participants. Hence, miner can extract the original data(local support count
of all itemsets) of any participants which violate the privacy. ThereforeTherefore, this
approach fails in privacy preserving with insecure communication environment among
the collaborative participants.
Y. Jin et al. [39] proposed an efficient and secure association rule mining scheme
based on FP-Tree in a distributed environment. They come up with the solution of merg-
ing the FP-Tree from each collaborative participants at the host computer. They used
Paillier’s homomorphic encryption and DES encryption scheme for privacy preserva-
tion of itemset count. It requires both the symmetric and asymmetric key distribution
among the participants. They assumed host computer which combines the FP-Tree
from all participants as trusted and not colludes with other participants. Due to the
symmetric DES encryption scheme, FT-Tree of all participants is disclosed with the
host computer. Host computer merges all FP-Tree using depth-first search method to
123
get transactions from one FP-tree and then insert them into another FP-tree. Hence,
this approach requires the trusted host computer (Trusted Third Party), and it is not
secure against collision among the collaborative participants and host computer.
Vaidya et al. [40] proposed privacy preserving approach for association rule mining
in vertically partition data. The requirement of a higher number of the random vector
generation and size of matrices incur the higher computation and communication
cost. Jung et al. [41] proposed privacy preserving computation of dot-product without
secure channel. Participants are organized in a circle and exchange the encrypted
data to the next and previous participant in a circle. This approach requires at least
three participants. It is also vulnerable to some collusion attacks presented in [42].
The solution presented for these types of attacks requires the exchange time must
be higher than the number of colluding participants. This solution doesn’t work in
a real scenario, as the participant never gets to know the number of colluding users.
The privacy preserving association rule mining in vertically partitioned data is also
discussed in [40,43–48].
Yi Huang [49] presented the concept of mining association rules on medical exami-
nation data and outpatient medical record to discovers the correlation between disease
and abnormal test results from the medical data of the regional Taiwan hospital. The
issue of privacy during the integration of medical examination data and the outpatient
medical record has not been considered in this approach. Privacy of patients must
be preserved during the mining of association rule due to some law or regulation.
We focus on designing an efficient approach for PPARM on vertically partitioned
data. Our proposed approach preserves the privacy of both the participants and gives
accurate results.
In most of the existing PPDM approaches, all participant systems must participate
in the data mining task. In major cases, it is designed for some single data mining tech-
nique (Association Rule Mining (ARM), classification, clustering, etc.) with specific
attributes (MST and MCT value in ARM) [49]. Hence, all collaborative participant
systems must have data mining servers. Any new analysis with the change in attributes
values or techniques forces to re-execution of the algorithm with all local EHR sys-
tems or requires modification in the existing algorithm. From the above observations
in existing research, there is a need to design new approach which collects data from
local EHR system to some central data mining server or cloud while preserving the
privacy and works without data mining servers at local EHR system. Central data
mining server has a collection of all local EHR systems’ data, which facilitates any
data mining techniques to medical researchers and physicians.
Table 2 illustrates the strengths and weaknesses of some existing PPARM
approaches for distributed data.
In this paper, we have focused on designing a novel scheme which facilitates med-
ical researchers and physicians to analyze healthcare patterns, diseases prediction and
healthcare services improvement for better cure of patients. We propose source anony-
mous privacy preserving distributed healthcare data collection and mining scheme. It
preserves the privacy of each EHR system or participant while aggregating the transac-
tion data at central data mining server(or cloud) for advanced healthcare data mining.
We have considered all EHR systems data as horizontally partitioned as major EHR
systems have same schema. Improvement in healthcare service using the proposed
123
Table 2 Review of Existing PPARM Techniques on Distributed Data
Author Strengths Weaknesses
123
Murat K. Et This approach preserves privacy without trusted third party server. It Any external attackers or collaborative participants can trace the
al. [33] works in semi-honest model. Here, support count of candidate received and sent information of targeted participants. By computing
itemset is calculated using simple random number addition by the the difference of these values, he/she can learn the private value of
initiator site to preserve the privacy. As a result, lower computation targeted participant. Hence, the private information of any participant
cost is incurred at each site. might be disclosed which inturns violate the privacy requirements.
Nirali et al. Authors have proposed a scheme based on Shamir’s secret sharing This scheme has higher communication cost as it computes the global
[35] technique in semi-honest model. This scheme preserves the privacy support counts of all candidate itemsets. It also requires the secure
in the presence of collusion among the participants using (n,n) communication environment among all the collaborative participants.
threshold shamir’s secret scheme.
M. Hussein They used paillier homomorphic encryption scheme instead of adding This scheme fails to preserve the privacy in case of insecure
et al. [34] a random number scheme by Murat K. Et al. [33] in order to preserve communication environment as initiator site can trace all the
the privacy against the external and internal attackers. communication between combiner and normal participants. Initiator
can decrypt the traced data to find the support count of targeted site
which fails to preserve the privacy.
H. Chahar et This scheme preserve privacy with insecure communication Initiator site can trace all the communication between combiner and
al. [38] environment using elliptic-curve-based paillier cryptosystem. normal participants. Initiator can decrypt the traced data to find the
support count of targeted site which fails to preserve the privacy.
Y. Jin et al. They come up with the solution of merging the FP-Tree from each It requires the trusted host computer. Due to the symmetric DES
[39] collaborative participants at the host computer.They used Paillier’s encryption scheme, FT-Tree of all participants is disclosed with the
homomorphic encryption and DES encryption scheme for privacy host computer. Host computer merges all FP-Tree using depth-first
preservation of itemset count. It requires both the symmetric and search method which causes the higher computation cost.
asymmetric key distribution among the participants.
Yi Huang This scheme explores the ARM on VPHD using medical examination Privacy issue during ARM on VPHD is not considered.
[49] data of patients and outpatient medical records of regional Taiwan
hospitals.
Vaidya et al. Preserves privacy with ARM on vertically partitioned data. Requires higher number of the random vector generation and size of
[40] matrices incur the higher computation and communication cost.
scheme is discussed with heart disease dataset (accessible from UCI repository). It
shows that the accuracy of disease predictions improves using the proposed scheme
compared to the accuracy of each EHR system.
1.2 Our contribution
Our contribution to this research is as follows:

1. Initially, we investigate the demand of data mining in healthcare and analyze the
necessity of better accuracy of data mining results for critical diseases related
prediction and improvement in patients’ cure.
2. We propose source anonymous privacy preserving distributed healthcare data col-
lection and mining which provides accurate healthcare data mining platform to
medical researchers and physicians.
3. Theoretical and practical analysis of proposed scheme shows that proposed scheme
is source anonymous, efficient in terms of computation and preserves privacy of
EHR systems.
4. Improvement in EHR systems using proposed scheme is discussed with heart
disease dataset (accessible from UCI repository). It shows that the accuracy of
healthcare predictions improves using proposed scheme compared to accuracy of
each EHR system.
The structure of this paper is as follows: Section 2 presents the system and security
model. Section 3 discusses the preliminaries and proposed scheme. Theoretical and
practical analysis of the proposed scheme is discussed in Sect. 4. Section 5 shows
the advantage to EHR systems using proposed scheme. Finally, Sect. 6 concludes the
proposed scheme.
2 System and security models
2.1 System model
Our system model involves three different entities as (1) n participants for collaborative
data mining, (2) Trust authority (TA) and (3) Central data mining server or cloud. We
have considered n participants in our model, which are interested in sharing their
private data with a central data mining server. This, in turn, helps them in accessing
the results of data mining patterns discovered from aggregated data and improving their
services. Central data mining server aims at collecting data from n participants and
performs some data mining techniques that include association rule mining, clustering
and classification for research in specific domain. Figure 2 shows the system model
for our scheme. In this figure, we consider an example of healthcare domain as an
application of our proposed approach with EHR systems as participants. Details of
each entity are described as follow:
– Participants : In major cases, participants are systems with private data of spe-
cific domain. For example, healthcare providers with EHR systems, insurance
123
Fig. 2 System model of our scheme
providers, banking, retail shop branch with customer transaction data etc. are also
included as participants for data mining to discover some data mining results for
mutual benefits. All participants are interested in sharing their data to central data
mining server and access data mining patterns of aggregated data. Considering
n participants in the system, in case of healthcare providers with EHR systems,
named as E H R1 , E H R3 , E H R3 , ..., E H Rn , which collaborate by sharing their
data of single record in each time period, for instance t second. Communication
among participants is not required.
– Trusted Authority (TA): The responsibility of TA is to initialize the system, which
includes registering the participants and central data mining server. It generates
and distributes the keys as well as reveals and revokes malicious participants.
TA becomes off-line once the initialization phase is completed. In case of some
abnormal behaviour of participants, it becomes on-line.
– Central Data Mining Server (or Data Mining Cloud) : It stores periodically
received data from participants. It facilitates researchers and data mining experts
by performing data mining task on collected aggregated data to discover some
new patterns for mutual benefits. It shares data mining results to all participants
for improvement in their services.
2.2 Security model
Our main focus is on the participants’ privacy as the shared data with a central data
mining server may contain some sensitive data. Data and its owner should not be
linked by any adversary to protect the privacy of participants and central data mining
server. Malicious participant’s real identity is revealed by TA, if any malicious data is
detected and TA will revoke that participant from the system.
123
– Trusted Authority (TA): We assume that TA is fully trusted and cannot be compro-
mised. Communication environment between participants and TA is assumed as
secure or can be protected by cryptography tool.
– Participant: Participant is considered as honest participant who follows the pro-
tocol, but curious to collect other participant’s data. Some malicious participants
collude with each other to retrieve the secret key or data of other participant which
in turn violates the privacy of that participant. Another issue is providing incor-
rect data to central data mining server by malicious participants which leads to
wrong data mining patterns. Our scheme helps to trace and identify the malicious
participant when any misleading data is found at the central data mining server.
– Central Data Mining Server: Central data mining server is considered as semi-
honest, which honestly follows the protocol and collects the data from participants.
It may try to collude with some participants to identify the secret key of other
participant to link data with its system.
2.3 Design goals
Our main goal is to preserve the privacy of all participants. We also consider the
collusion attack by internal participants or some other malicious users. Our scheme
tries to achieve the following goals.
1. Protecting participant’s privacy: During the collection of sensitive data at cen-
tral data mining server, the scheme protects the tempering and eavesdropping of
original data. Any participant’s ciphertext cannot be decrypted by other systems.
Central data mining server detects tempered data and send re-transmission request
to a particular participant.
2. Protecting privacy of collected data: Anyone including participants and other
malicious users from outside the system can trace all data packets exchange among
participants. If they can recover the original data, then it creates serious issues
related to privacy. Hence, our scheme also focuses on preventing the recovery of
original data by any participants or malicious users from outside.
3. Computation Accuracy and Efficiency: Central data mining server collects original
data and performs different data mining techniques accurately. Participants can
efficiently encrypt the data and compute the signature of it. Central data mining
server can efficiently verify all participants’ signature and retrieve the original data
accurately.
In the rest of the paper, we focus on healthcare domain as an application of proposed
approach. Hence, we have considered the healthcare EHR systems as participants in
the system and security model discussed above.
123
Fig. 3 An Example of k-source anonymous scheme with 3 user’s original data ={12,13,9} and
sequence={2,1,3}
3 Proposed source anonymous scheme for privacy preserving

distributed healthcare data collection and mining
3.1 Preliminary
3.1.1 k-source anonymous
Let, there is a group of k participants, each participant shares its data to some central
data mining server. Central data mining server can receive all the data. Still, it cannot
link this data to its owner, because every participants’ data is present in the dataset of
k element in random order. Here, participant’s data achieves k-source anonymity [50].
If all the collected data at a central data mining server achieve k-anonymity, we can
say that our scheme achieves k-source anonymous.
c
Data collection scheme is k-source anonymous, if P(..., u i (d i ), ..., u j (d j )) ≡
P(..., u i (d j ), ..., u j (d i )) is satisfied, where k indicates the number of participants
for any group U , u i and u j are two participants in U , {d1 , ..., dk } ∈ {M}k is a data
collection sample. M indicates the message space, P(..., u i (xi ), ...) is central data
c
mining server’s view with xi (xi ∈ {d1 , ..., dk }) as u i s input (i = 1, 2, 3, ...k), ≡
indicates computational indistinguishability of two random variable set.
Figure 3 illustrates the aggregation of original data using source anonymous tech-
nique with 3 participants. In this example, the bit strings that participant generates
and central data mining server receives three parts, one for each of the participant. In
general, bit strings consist of n parts with n participants in the protocol. Each partici-
pant fills one part out of three with its original value based on the received sequence
number, while filling the other two parts as dummy data. For example, participant-1
with sequence number 2, fills the original data in the second part of the bit string and
123
remaining two parts as dummy string with all bits as 0. Central data mining server
receives three messages in encrypted form from respective participants. After decryp-
tion, the central data mining server performs XOR and breaks the original bit string
into three equal parts to compute the original data. As shown in Fig. 3, central data
mining server computes bit string 1101 1100 1001 and learns the original data as 1101,
1100, 1001. Here, the central data mining server could not link the original data with
3 participants, respectively. Hence, this scheme is known as k-source anonymous with
k participants.
In this paper, we design a novel scheme for data collection and mining with untrusted
central data mining server which collects the data from different EHR systems with
horizontally partitioned data in k-source anonymous.
3.1.2 Aggregate signature
Let, G 1 and G 2 are two cyclic group of prime order q with generator g1 and g2 ,
respectively. There is one group G T such that |G T | = |G 1 | = |G 2 |. A bilinear map
ê : G 1 × G 2 → G T has following properties.
– Bilinear : For all u ∈ G 1 , v ∈ G 2 , and a, b ∈ Z, ê(u a , v b ) = ê(u, v)ab .
– Non-degenerate: ê(g1 , g2 ) = 1. Full domain hash function H : {0, 1}∗ → G 1
used by aggregate signature [51] and includes following five phases
Key Generation: Participant’s public and private keys are v ∈ G 2 and x ∈ Z p
R
respectively, where x is randomly selected by participant as x ← Z p . Public key
v is computed as v ← g2x .
Signing: Participant with private key x and a message M ∈ {0, 1}∗ computes
h ← H (M) , where h ∈ G 1 and σ ← h x . Participant’s signature σ ∈ G 1 .
Verification: From participant’s public key v, a message M, and a signature σ ,
calculate h ← H (M). Signature is accepted only if ê(σ, g2 ) = ê(h, v) condition
gets satisfied.
Aggregate Signature : Each participant has index i, where i ∈ [1, k] and k = |U |.
Each participant u i ∈ U provides signature σi ∈ G 1 on message M ∈ {0, 1}∗ .
k
Compute aggregate signature σi ∈ G 1 as σ ← i=1 σi
Aggregate
k Signature Verification: Aggregate signature is valid only if ê(σ, g2 ) =
i=1 ê(h i , vi ) is satisfied, where h i ← H (Mi ) for 1 ≤ i ≤ k , vi is a public key
of participant u i .
3.2 Proposed scheme
Our proposed scheme starts with a system initialization phase. In this phase, TA
generates participant’s private/public key for authentication and encryption. It also
generates private key for central data mining server for decryption. Before collecting
data from n EHR systems, it first confirms the time period t with all EHR systems.
All EHR systems convert transactions into binary format, encrypt one transaction in
every time period and sign the encrypted transaction. Then, all EHR systems send
it to central data mining server. EHR systems with less number of transactions in
case of horizontally partitioned data send the pre-defined value along with other EHR
123
Fig. 4 Final transaction format at central data mining server
systems, which helps in decryption at central data mining server. Central data mining
server aggregates all EHR systems’ signatures and authenticates them. If all signatures
are valid, then original transactions can be recovered, but cannot be linked with any
EHR system’s identity. Recovered transactions are formatted based on the horizontally
partitioned data, as shown in Fig. 4.
Our scheme works in three phases: I nitiali zation phase distributes keys to
all EHR systems and central data mining server. Encr yption and Sign phase
encrypts EHR systems’ transactions and compute the signature of encrypted trans-
action. Authenticate and Decr yption phase authenticate received data at central
data mining server, decrypt the received data to recover original transactions and create
final transactions based on the horizontally partitioned distribution as shown in Fig. 4.
Basic encryption and decryption operations are discussed here with the following
equation,
s1 ⊕ s2 ⊕ s3 ⊕ ... ⊕ sn = s1 ⊕ s2 ⊕ s3 ⊕ ... ⊕ sn (1)
Next, sk (k ∈ [1, n]) is used as key for pseudo-random hash function h as
h s1 (t) ⊕ h s2 (t) ⊕ h s3 (t) ⊕ ... ⊕ h sn (t) = h s1 (t) ⊕ h s2 (t) ⊕ h s3 (t) ⊕ ... ⊕ h sn (t)
(2)
Left side of Eq. 1 is assigned to all EHR systems and right side is assigned to central
data mining server. All of them use the same t in hash function h and computes h sk (t)
as pad for encryption and decryption of transactions, thus the central data mining sever
can eliminate all pads with its keys.
However, all EHR systems collectively calculate n hash functions, central data
mining server has to calculate n hash functions by itself. Therefore, we reduce the
computation at central data mining server by moving some elements from right side
to left side in Eq. 1 as:
s1 ⊕ s2 ⊕ s3 ⊕ ... ⊕ sn ⊕ s1 ⊕ s2 ⊕ s3 ⊕ ... ⊕ sn−q = sn−q+1 ⊕ ... ⊕ sn (3)
123
Table 3 Notations
Notation Discription
n Number of EHR systems

Ṡi E H Ri additive private key set
Ŝi E H Ri subtractive private key set
Ṡ Universal additive private keyset
Ŝ Universal subtractive private key set
Si E H Ri encryption key set Si = Sˆi ∪ S˙i
Sa Central data mining server’s decryption key set
c Size of Ṡi
q Size of Sa
di E H Ri transaction data
l Lenth of transaction at EHR system in bits
H Hash function H : {0, 1}∗ → G 1
h s (x) hash function indexed by s in a pseudo-random function
Hl,m+log2 n ,l = {h s : {0, 1}m+log2 n →
{0, 1}l }s∈{0,1}l
pki Public signing key of E H Ri
pki Private signing key of E H Ri
As a result, central data mining server has less number of private keys and calculates
less number of hash functions. Each EHR system has to calculate some higher number
of hash functions. Notations used in the scheme are listed in Table 3.
I nitiali zation phase: TA computes the bilinear parameters (q, G 1 , G 2 , G T , ê,
g1 , g2 ) and selects two hash functions H : {0, 1}∗ → G 1 and h s (x) which is
hash function indexed by s in a pseudo-random function Hl,m+log2 n ,l = {h s :
{0, 1}m+log2 n → {0, 1}l }s∈{0,1}l .
Next, TA computes S = {s1 , s2 , s3 , ..., snc } ∈ {0, 1}l as a private keys randomly
for n EHR systems. These keys are distributed as follow:
– All private keys are randomly distributed among n participants with n disjoint
subset as Ṡ1 , Ṡ2 , Ṡ3 , ..., Ṡ
n . Here, Ṡ1 indicates the EHR i’s private additive key,
n
where | Ṡi | = c, and Ṡ = i=1 Ṡi is a universal additive key.
– Randomly q private keys from S are selected by TA and forms a subset Sa . Remain-
ing nc−q private keys are divided into n random disjoint set as Ŝi , i = 1, 2,3, ..., n
n
which indicates subtractive private keys. Universal subtractive key Ŝ = i=1 Ŝi ,
where Ṡ = Ŝ ∪ Sa .
– Let Si = Ŝi ∪ Ṡi , for i = 1, 2, 3, ..., n is sent to EHRi as encryption keys and Sa
is central data mining server as decryption key.
This phase also generates signing keys for participants. For each participant i(1 ≤
i ≤ n) , TA generates participant ID as Uidi = {0, 1}log2 n , private key for signing
R i
ski = xi ← Z and public key pki = g2x .
Encr yption and Sign phase: For each time period t, sequence number sq(i) ∈
[1, n] is generated for participant i, where (1 ≤ i ≤ n). {sq(i)}i=1,2,3,...,n is a permu-
123
Algorithm 1 Encryption and Sign of transaction at EHR system

INPUT:
For each EHR system i (i = 1, 2, 3, ..., n), input it’s pseudo Uidi , private encryption key Si and private
signing key ski = xi .
Each EHR system has transaction data d i ∈ 0, 1l in the time period t.
Symbol a|b indicates the concatenation of a and b and ⊕h s∈Si (x) indicate the exclusive-or of all results
of function h for s ∈ Si .
Output:
EHR system i outputs ei and σ i
Begin
1. Genrate n random l-bit string k ij ( j = 1, 2, 3, ..., n) using
k ij = ⊕h s∈Si (t| j)
2. Transaction data d i is encrypted as :
ei = ({0}l ⊕ k1i )|({0}l ⊕ k2i )|...|(d i ⊕ ksq(i)

i )|...|({0}l ⊕ kni )
3. The signature is computed as:
σ i = H xi (Uid |ei ) ∈ G 1
4. Send the final ei and σ i to central data mining server

End
tation of {1, 2, 3, ..., n}, which specify the order of participants data. Each participant i
encrypt it’s transaction according to sq(i). Central data mining server can not recover
the owner of i’th transaction as {sq(i)}i=1,2,3,...,n is not known to central data mining
server. For strong security, sequence number is changed randomly. In our approach,
sequence number is generated by TA. In any single iteration, EHR system i has l bits
of transaction for sharing, it shares n ∗l bits as cipher bits to central data mining to hide
it’s own l bits of transaction. Hence, each EHR system has to generate extra n ∗ l − l
bits.
As shown in algorithm 1, each EHR system use it’s encryption key as private key
of h s (x) to generate n ∗ l bits as k1i , k2i , k3i , ..., kni . Here, all k ij are different as they are
computed using (t| j). Original data of participant i is d i . we encrypt d i using ksq(i) i
instead of ki . Remaining zero string encrypted using k j ( j = sq(i). Total n encrypted

i i
strings of l-bit are computed as {0}l ⊕ k1i , {0}l ⊕ k2i , ..., d i ⊕ ksq(i) i , ..., {0}l ⊕ kni . Then
ei is computed using concatenation of each l-bit into single n ∗ l bits. EHR system,
which does not have data to share, choose d i as all 0.
At last, signature σ i is computed. Each EHR system i executes algorithm-1 and
sends ei and σ i to central data mining server with it’s Uidi .
Authenticate and Decr yption phase: Central data mining server executes
Algorithm-2 to authenticate the received data from n EHR systems and decrypt it.
Central data mining server uses the public key of EHR systems to authenticate the
received data from EHR systems. For any tampered signature, Algorithm-2 return
-1. In this case, central data mining server trashes the invalid data and sends a re-
123
Algorithm 2 Authenticate and Decryption at Central Data Mining Server

INPUT:
x
For each EHR system i (i = 1, 2, 3, ..., n), input it’s pseudo Uidi , public signing key pki = g2 i and
i
computed value of e and σ i
Central data mining decryption keys and time period t are requested.
OUTPUT:
Central data mining server with aggregate transnational data from horizontally or vertically partition
transnational data from all EHR system.
Begin
1. Central data mining server aggregate all received signature and authenticate it as:
⎛ ⎞

n
?
n
ê ⎝ σi , g2 ⎠ = ê(H (Uid |ei ), pki )
i=1 i=1
2. Algorithm return -1 if above step does not satisfied, else it computes:
c j = ⊕h s∈Sa (t| j) f or j = 1, 2, 3, ..., n

C = c1 |c2 |c2 |...|cn
3. Central data mining server recover the original transaction of all EHR system as:
T = e1 ⊕ e2 ⊕ e2 ⊕ ... ⊕ en ⊕ C
4. In horizontal partition data, central data mining sever divide T into n parts, each of l-bits and each part
becomes a single transaction. Central data mining server arrange all these n transaction sequentially
into final aggregate transitional data as shown in Fig. 4.
5. After collection of all transactions, Central data mining server executes the specified data mining tech-
nique(Association Rule mining(ARM), Classification, Clustering ) with some input attributes(Min.
support and Confidence in ARM etc.)
6. Result of data mining technique on aggregate healthcare transactions is shared with all EHR systems
to improve the healthcare system and better patient cure.
End
transmission request. Otherwise, all EHR systems’ transaction can be recovered by

central data mining server’s private key.
Central data mining server performs exclusive-OR on all cipher text to decrypt the
EHR systems data. Let, ea = e1 ⊕ e2 ⊕ e3 ⊕ ... ⊕ en . ea is consist of n l-bit string as
ea = e1a |e2a |e3a |...|ena . Cipher text of d i is esq(i)
a .
a
esq(i) = ksq(i)
1
⊕ ksq(i)
2
⊕ ... ⊕ ksq(i)
n
⊕ di
= (⊕h s∈ Ŝ (t|sq(i))) ⊕ (⊕h s∈ Ṡ (t|sq(i))) ⊕ d i
= (⊕h s∈Sa (t|sq(i))) ⊕ d i = csq(i) ⊕ d i
Hence, central data mining server first computes C as shown in step-2 of algorithm
2 for decryption of cipher text. Step-3 recovers the original transaction T which can
be represented as T = {m[1, l], m[l + 1, 2l], ..., m[(n − 1)l + 1, nl]}, where m[x, y]
is EHR system’s original transaction. EHR system with horizontally partitioned data,
central data mining server divides T into n transaction each of length l-bits as T1 =
123
m[1, l], T2 = m[l + 1, 2l], ..., Tn = m[(n − 1)l + 1, nl]} and stores it as a separate
transaction. However, central data mining server can not link each transaction with
it’s EHR system, because sq(i) in not known to it.
For any abnormal transaction value at central data mining server, for example,
m[x, y], it can request to TA for identification of malicious EHR system. Central data
mining server sends x, y, ei and σ i to TA. TA knows the sq(i), which helps it to
identify the malicious user and report it.
4 Theoretical and practical analysis of proposed scheme
4.1 Our scheme is source anonymous
k-source anonymity is discussed in section 3.1.1. Here, we are interested to prove that
if we interchange any two EHR systems’ transactions in the same interval, any adver-
sary and central data mining server cannot recognize these difference in the changes.
Given n EHR systems {E
H R1 , E H R2 , EH R3 , ..., E H Rn } and their corresponding
transaction data D = d 1 , d 2 , d 3 , ..., d n ∈ M n , where M = {0, 1}l . Each EHR
system E H Ri executes algorithm 1 to encrypt the transaction data d i and send the
encrypted data ei to central data mining server. Length of ei is n ∗ l-bits as shown
in step-2 of algorithm-1. We can represent ei a as sequence of n strings of l-bits as
follow:
ei = e1i | e2i | e3i |...| eni (4)
where, eij = {0}l ⊕ k ij f or j = 1, 2, 3, ..., n, j = sq(i) and eij = d i ⊕ k ij f or j =

sq(i). For particular period of time, after collection of all encrypted transactions,
central data mining server aggregate encrypted transaction as follow:

A = e11 | e21 |...|en1 , e12 | e22 |...|en2 , ..., e1n | e2n |...|enn , (5)
Let, two EHR system, E H Ri and E H R j 1 ≤ i ≤ j ≤ n, switch their data. Hence

original transaction data of all EHR system becomes

D = d 1 , ..., d i−1 , d j , d i+1 , ..., d j−1 , d i , d j+1 , ..., d n
. Therefor, central data mining server has aggregated encrypted data representation
changed as follow:
A = (e11 |... |ei−1

1
| e1j | ei+1
1
|...| e1j−1 | ei1 | e1j+1 |...|en1 | ,
e12 |... |ei−1
2
| e2j | ei+1
2
|...| e2j−1 | ei2 |
e2j+1 |...|en2 | , e1n |... |ei−1
n
| enj | ei+1
n
|...| enj−1 | ein | enj+1 |...|enn ) (6)
123
To prove source anonymous property, we have to prove
c
A ≡ A (7)
for any value of i and j, where 1 ≤ i ≤ j ≤ n. To prove above relation, we have

computed a pseudo-random permutation function g p : [1, n] → [1, n] to compute
the permutation of [1, n] as [g p (1), g p (2), g p (3), ..., g p (n)]. Permutation of original
transnational data D can be represented as
D = {d g p (1) , d g p (2) , d g p (3) , ..., d g p (n) } (8)
Execution of algorithm-1 on the original transactional data D generates the encrypted

data as follow:

g (1) g (1) g (1) g (2) g (2) g (2)
A = e1 p | e2 p |...|en p , e1 p | e2 p |...|en p , ...,

g (n) g (n) g (n)
e1 p | e2 p |...|en p (9)
Here, we cannot distinguish D and its pseudo-random permutation D in polynomial

time, the A and A are computationally indistinguishable. Hence, following equation
holds:
c
A ≡ A (10)
Same way we can derive the relationship between A and A as
c
A ≡ A (11)
From the equation and equation, we can prove the relation between A and A as
c
A ≡ A (12)
4.1.1 Collusion resilient scheme
The security against collision attack depends on the size of central data mining server’s
decryption keys and EHR systems encryption keys. Security level can be increased
by increasing the size of Sa , or q and the size of Si or c. Level of security is also
increased with increasing the number of EHR systems. In our approach, malicious
participants may collude to recover the central data mining server’s decryption keys,
or collude with central data mining server to reveal the target EHR system’s encryption
keys. Let Pe indicates the probability with which targeted EHR system’s encryption
keys can be computed successfully in single attempt, Pc indicates the probability to
compute the central data mining server’s decryption keys in single attempt, γ indicates
123
the proportion of malicious EHR systems collude with central data mining server. As
discussed in [52], we can write,
1
Pe ≤ (1−γ )nc (1−γ )n(nc−q)/n (13)
c · (nc−q)/n
The value of Pe decreases with increasing n and c which indicate better security of
EHR systems, but it increases with increasing γ
1
Pc ≤ (1−γ )nc (14)
q
The value of Pc increases with increasing value of q which indicates better security of
central data mining server. Let, there are 10% malicious EHR systems in the collabo-
ration, the security parameter’s essential value are Pe ≤ 2−l and Pc ≤ 2−l . When the
number of EHR systems are 100 with transaction length l=80bits, q and c can be set
to 13 and 7, which indicates the central data mining server has 13 decryption key and
every EHR system has 14 encryption keys. Here, the security conditions Pe ≤ 2−l
and Pc ≤ 2−l are satisfied. In case of changes in number of EHR systems, we can
adjust q and c to prevent the collusion attack.
4.1.2 Authenticity and integrity
Our scheme uses the aggregate signature to achieve the authentication and EHR sys-
tem’s integrity. Authentication and integrity can be broken by the adversary if and only
if it can temper the aggregate signature. Aggregate signature discussed in section 3.1.2
is proved as secure with chosen-key security model [51]. There is a negligible prob-
ability of generating valid signature by any adversary. Therefore, tempering the data
communicated between EHR system and central data mining server can be detected at
central data mining server using authentication steps as shown in Algorithm-2. Hence,
data integrity and authentication are achieved.
4.2 Experimental analysis
We implemented our scheme in NetBeans with the system having a configuration

of Intel core i3 2.1GHz CPU and 4GB RAM. In our scheme, major cost at each
EHR systems is to encrypt and sign the transnational data, central data mining server
has major cost of authentication and decryption. We used the hash function HMAC-
SHA512 as pseudo-random function discussed in [53]. We used heart disease dataset
available publicly on UCI repository [54] for experimental analysis. We have analyzed
the computation cost of the proposed approach and existing approach Y. Jin et al. [39]
with 4, 6, 8 and 10 collaborative EHR systems.
In the proposed approach, each EHR system executes Algorithm 1, hence the com-
putation cost of EHR system depends on the cost of algorithm 1. Central data mining
123
Fig. 5 Analysis of computation cost of proposed approach and existing approach [39]
Table 4 Prediction confidence/accuracy (%) at each EHR system and central data mining server
Association Rules Confidence (%)

EHR1 EHR2 EHR3 EHR4 Central Data Mining Server
{Slope = flat ∩ Type of chest 90% 85% 86% 92% 99%

pain = asymptomatic ∩
Exercise induced angina =
yes ∩ Restecg = normal∩
Sex=female } → {Class =
sick }
{ Exercise induced angina=no 92% 88% 85% 83% 98%
∩ Number of number of
vessels colored = 0 ∩
Sex=female } → { Class =
healthy }
{Thal = reversable defect ∩ 89% 88% 89% 93% 96%
Slope = flat ∩ Type of chest
pain = asymptomatic } → {
Class = sick }
server authenticates the received data from all EHR systems and performs the decryp-
tion to recover the original transactional data from all EHR systems using Algorithm 2.
Figure 5 shows the computation and communication cost of the proposed and existing
approach with 4, 6, 8 and 10 EHR systems. It shows that the proposed approach is
efficient compared to existing approaches in terms of computation cost. Computation
cos of Y. Jin et al. [39] approach is higher due to the complex operation of merging
the FP-Tree from all EHR systems
123
Table 4 continued
Association Rules Confidence (%)

EHR1 EHR2 EHR3 EHR4 Central Data Mining Server

Exercise induced angina =
yes ∩ Type of chest pain =
asymptomatic} → { Class
= sick }
{Thal = normal ∩ Number of 85% 81% 75% 90% 93%
vessels colored = 0 ∩ Slope
= up ∩ Sex = male } → {
Class = healthy }
{Exercise induced angina = 79% 80% 89% 87% 93%
yes ∩ Restecg =
hypertrophy ∩ Type of
chest pain = asymptomatic
∩ Sex = male } → { Class
= sick }
Fasting blood sugar = no ∩
Type of chest pain =
asymptomatic ∩ Exercise
induced angina = yes ∩ Sex
= male } → { Class = sick }
5 Advantage to EHR systems using proposed scheme
EHR systems can take advantage of the proposed scheme for improving patient care
and healthcare using global data mining results from the central data mining server.
Next, we discuss one example of predicting heart disease by four EHR systems from
the patients’ attributes using association rule mining. Heart disease dataset contains a
total of 76 different attributes with 303 patients records. Total 14 attributes associated
with heart disease are used in this experiment. Detail of each attribute is available
at [55]. Association rules with Class = Healthy or Sick on the right-hand side are
selected, and the confidence value of these rules at each EHR system and central data
mining server is shown in Table 4.
The confidence of association rule indicates the prediction accuracy of heart dis-
ease. As shown in Table 4, heart disease prediction at each EHR system has lower
precision compared to central data mining server results. Hence, the proposed scheme
benefits every participant EHR system and physicians or medical researchers by allow-
ing access to global results which help them to improve healthcare services. It also
facilitates physicians to apply any data mining technique with selected parameters
(confidence value in case of association rule mining) to discover patterns related to
the diseases and patients’ attributes.
123
6 Conclusion
In this paper, we proposed a novel scheme for improving the healthcare services
by aggregating all EHR systems data at one central data mining server, while also
preserving the privacy using k-source anonymous. Central data mining server can
execute any data mining techniques selected by medical researchers or physicians on
aggregated healthcare data to discover accurate healthcare patterns. Our scheme is also
collision resilient which preserves the privacy in case of collision among malicious
EHR systems and central data mining server. EHR systems without data mining power
can join central data mining sever to take the advantage of global healthcare data
mining results by sharing it’s own data. Performance analysis shows that our scheme
is efficient at EHR systems and at central data mining server. In this paper, dynamic
joining and leaving of EHR systems have not been discussed. In future, we can modify
the proposed approach which includes dynamic joining and leaving of EHR systems.
References
1. Tang PC, McDonald CJ (2006) Electronic health record systems. Biomed Inform 10(4):447
2. Shin AM, Lee IH, Lee GH, Park HJ, Park HS, Yoon KI, Lee JJ, Kim YN (2010) Diagnostic analysis
of patients with essential hypertension using association rule mining. Healthc Inform Res 16(2):77
3. Palaniappan S, Awang R (2013) Intelligent heart disease prediction system using data mining tech-
niques. Int J Healthc Biomed Res 1:94
4. Ordonez C (2006) Association rule discovery with the train and test approach for heart disease predic-
tion. IEEE Trans Inf Technol Biomed 10(2):334
5. Sung SF, Lee PJ, Hsieh CY, Zheng WL (2020) Medication use and the risk of newly diagnosed diabetes
in patients with epilepsy: a data mining application on a healthcare database. J Organ End User Comput
(JOEUC) 32(2):93
6. Pramanik PKD, Pareek G, Nayyar A (2019) Security and Privacy in Remote Healthcare: Issues, Solu-
tions, and Standards. In: Telemedicine Technologies. Academic Press, pp 201–225. https://doi.org/10.
1016/B978-0-12-816948-3.00014-3
7. Luukka P, Lampinen J (2010) A classification method based on principal component analysis and
differential evolution algorithm applied for prediction diagnosis from clinical emr heart data sets. In:
Computational intelligence in optimization, Springer, pp 263–283
8. Hossain ME, Khan A, Moni MA, Uddin S (2019) Use of electronic health data for disease prediction:
A comprehensive literature review. IEEE/ACM transactions on computational biology and bioinfor-
matics, pp 1–20. https://doi.org/10.1109/TCBB.2019.2937862
9. Karabatak M, Ince MC (2009) An expert system for detection of breast cancer based on association
rules and neural network. Expert syst Appl 36(2):3465
10. Pramanik PKD, Nayyar A, Pareek G (2019) WBAN: Driving e-healthcare Beyond Telemedicine to
Remote Health Monitoring: Architecture and Protocols. In: Telemedicine Technologies. Academic
Press, pp 89–119. https://doi.org/10.1016/B978-0-12-816948-3.00007-6
11. VermaA, Taneja A, Arora A (2017) Fraud detection and frequent pattern matching in insurance claims
using data mining techniques. In: 2017 Tenth international conference on contemporary computing
(IC3), IEEE, pp. 1–7
12. Waldman D (2017) The Saga of 1115–A Waiver Can Fix Texas Medicaid, But Only Temporarily.
Texas Public Policy Foundation. [Online] Available: https://files.texaspolicy.com/uploads/2018/08/
16103444/2017-02-RR03-Sagaof1115Medicaid-CHCP-DeaneWaldman.pdf. Accessed 25 Jan 2020
13. Domadiya N, Rao UP (2018) Privacy-preserving association rule mining for horizontally partitioned
healthcare data: a case study on the heart diseases. Sādhanā 43(8):127
14. Payne TH, Lovis C, Gutteridge C, Pagliari C, Natarajan S, Yong C, Zhao LP (2019) Status of health
information exchange: a comparison of six countries. J Global Health 9(2):0204279
123
15. Rolnick J (2013) Aggregate health data in the United States: Steps toward a public good. Health Inf J
19(2):137
16. Clifton C, Kantarcioglu M, Vaidya J (2002) Defining privacy for data mining. In: National Science
Foundation Workshop on Next Generation Data Mining, vol. 1, pp 199–204
17. Sweeney L (2013) Matching Known Patients to Health Records in Washington State Data (June 5,
2013) pp 1–13. Available at SSRN: https://ssrn.com/abstract=2289850
18. Troncoso C, Payer M, Hubaux JP, Salathe M, Larus J, Bugnion E, Lueks W, Stadler T, Pyrgelis
A, Antonioli D, et al. (2020) Decentralized privacy-preserving proximity tracing. arXiv preprint
arXiv:2005.12273
19. Vora J, Nayyar A, Tanwar S, Tyagi S, Kumar N, Obaidat MS, Rodrigues JJ (2018) BHEEM: A
blockchain-based framework for securing electronic health records. In: 2018 IEEE Globecom Work-
shops (GC Wkshps), IEEE, pp 1–6
20. Health law by countries. [Online] Available: https://www.who.int/health-laws/countries/en/,
[Accessed: 19-April-2019]
21. Xu J, Wei L, Wu W, Wang A, Zhang Y, Zhou F (2020) Privacy-preserving data integrity verification by
using lightweight streaming authenticated data structures for healthcare cyber-physical system. Future
Gener Comput Syst 108:1287
22. Jayabalan M, Rana ME (2018) Anonymizing healthcare records: a study of privacy preserving data
publishing techniques. Adv Sci Lett 24(3):1694
23. Fung BCM, Wang K, Chen R, Yu PS (2010) Privacy-preserving data publishing: a survey of recent
developments. ACM Comput Surv 42(4):1
24. Liu J (2012) Privacy preserving data publishing: current status and new directions. Inf Technol J 11(1):1
25. Machanavajjhala A, Kifer D, Gehrke J, Venkitasubramaniam M (2007) l-diversity: Privacy beyond
k-anonymity. ACM Trans Knowl Discov Data (TKDD) 1(1):3
26. Khokhar RH, Chen R, Fung BC, Lui SM (2014) Quantifying the costs and benefits of privacy-preserving
health data publishing. J Biomed Inform 50:107
27. Gkoulalas-Divanis A, Loukides G, Sun J (2014) Publishing data from electronic health records while
preserving privacy: a survey of algorithms. J Biomed Inform 50:4
28. Yang JJ, Li JQ, Niu Y (2015) A hybrid solution for privacy preserving medical data sharing in the
cloud environment. Future Gener Comput Syst 43:74
29. El Emam K, Rodgers S, Malin B (2015) Anonymising and sharing individual patient data. Br Med J
350:11
30. Agrawal R, Srikant R (2000) Privacy-preserving data mining. In: Proceedings of the 2000 ACM
SIGMOD international conference on management of data, pp 439–450
31. Scardapane S, Altilio R, Ciccarelli V, Uncini A, Panella M (2018) Privacy-preserving data mining for
distributed medical scenarios. In: Multidisciplinary approaches to neural computing, Springer, NY, pp
119–128
32. Wang F, Li XL, Wang JT, Ng SK (2017) Guest editorial: special section on biological data mining and
its applications in healthcare. IEEE/ACM Trans Comput Biol Bioinform 14(3):501
33. Kantarcioglu M, Clifton C (2004) Privacy preserving distributed mining of association rules on hori-
zontally partitioned data. IEEE Trans Knowl Data Eng 16(9):1026
34. Hussein M, El-Sisi A, Ismail N (2008) Fast Cryptographic Privacy Preserving Association Rules
Mining on Distributed Homogeneous Data Base. Knowledge-Based Intelligent Information and Engi-
neering Systems vol 5178, pp 607–616. https://doi.org/10.1007/978-3-540-85565-1_75
35. Nanavati NR, Lalwani P, Jinwala DC (2014) Analysis and evaluation of schemes for secure sum in
collaborative frequent itemset mining across horizontally partitioned data. J Eng 2014:110
36. Shamir A (1979) How to share a secret. Commun ACM 22(11):612
37. Nguyen XC, Le HB, Cao TA (2014) An enhanced scheme for privacy-preserving association rules
mining on horizontally distributed databases. In: Proceedings of IEEE international conference on
computing and communication technologies, research, innovation, and vision for the future (RIVF),
IEEE, pp 1–4
38. Chahar H, Keshavamurthy BN, Modi C (2017) Privacy-preserving distributed mining of association
rules using Elliptic-curve cryptosystem and Shamir’s secret sharing scheme. Sādhanā 42(12):1997
39. Jin Y, Su C, Ruan N, Jia W (2016) Privacy-preserving mining of association rules for horizontally
distributed databases based on FP-tree. In: Proceedings of international conference on information
security practice and experience, Springer, pp 300–314
123
40. Vaidya JS (2004) Privacy preserving data mining over vertically partitioned data. Ph.D. thesis, West
Lafayette, IN, USA
41. Jung T, Li XY, Wan M (2015) Collusion-tolerable privacy-preserving sum and product calculation
without secure channel. IEEE Trans Dependable Secure Comput 12(1):45
42. Datta A, Joye M (2016) Cryptanalysis of a privacy-preserving aggregation protocol. IEEE Trans
Dependable Secure Comput 82:23
43. Kargupta H, Das K, Liu K (2007) Multi-party, privacy-preserving distributed data mining using a game
theoretic framework. In: Knowledge discovery in databases: PKDD 2007, Springer, NY, pp 523–531
44. Nanavati NR, Jinwala DC (2015) A novel privacy-preserving scheme for collaborative frequent itemset
mining across vertically partitioned data. Secur Commun Netw 8(18):4407. https://doi.org/10.1002/
sec.1377
45. Xu Z, Yi X (2011) Classification of privacy-preserving distributed data mining protocols. In: Proceed-
ings of sixth international conference on digital information management, IEEE, pp 337–342
46. Alwatban IS, Emam AZ (2014) Comprehensive Survey on Privacy Preserving Association Rule Min-
ing: Models, Approaches, Techniques and Algorithms. Int J Artif Intell Tools 23(05):145
47. Zhu XM (2013) Research on privacy preserving data mining association rules protocol. In: Advanced
Materials Research, Vol 756. Trans Tech Publications Ltd, pp 1661–1664. https://doi.org/10.4028/
www.scientific.net/AMR.756-759.1661
48. Domadiya N, Rao UP (2019) Privacy preserving distributed association rule mining approach on
vertically partitioned healthcare data. Procedia Comput Sci 148:303
49. Huang YC (2013) Mining association rules between abnormal health examination results and outpatient
medical records. Health Inf Manag J 42(2):23
50. Ni J, Zhang K, Lin X, Shen XS (2017) Securing fog computing for internet of things applications:
Challenges and solutions. IEEE Commun Surv Tutor 20(1):601
51. Boneh D, Mironov I, Shoup V (2003) A secure signature scheme from bilinear maps. In: Cryptographers
Track at the RSA Conference, Springer, pp 98–110
52. Li Q, Cao G, La Porta TF (2014) Efficient and privacy-aware data aggregation in mobile sensing. IEEE
Trans Dependable Secure Comput 11(2):115
53. Castelluccia C, Chan AC, Mykletun E, Tsudik G (2009) Efficient and provably secure aggregation of
encrypted data in wireless sensor networks. ACM Trans Sensor Netw (TOSN) 5(3):20
54. Heart disease dataset. [Online] Available: http://archive.ics.uci.edu/ml/machine-learningdatabases/
heart-disease/cleve.mod, [Accessed: 28-May-2019]
55. Cleveland heart disease data details. [Online] Available: http://archive.ics.uci.edu/ml/machine-
learning-databases/heart-disease/heart-disease.names, [Accessed: 28-May-2019]
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
123
View publication stats

Improving Healthcare Services Using Source Anonymous Scheme With Privacy Preserving Distributed Healthcare Data Collection and Mining

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Improving Healthcare Services Using Source Anonymous Scheme With Privacy Preserving Distributed Healthcare Data Collection and Mining

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Improving healthcare services using source anonymous scheme with privacy

Article in Computing · January 2021

Nikunj Domadiya Udai Pratap Rao

SEE PROFILE SEE PROFILE

Privacy Preserving Big Data Publishing View project

Privacy in Healthcare Data Mining View project

The user has requested enhancement of the downloaded file.

Improving healthcare services using source anonymous

Nikunj Domadiya1 · Udai Pratap Rao1

Received: 12 April 2020 / Accepted: 28 September 2020

1 Department of Computer Engineering, Sardar Vallabhbhai National Institute of Technology,

Keywords Healthcare · Data Mining · Privacy · Source Anonymous · Privacy

Mathematics Subject Classification 68P20 · 68P27 · 92C50

Fig. 1 Example of horizontally and vertically partition healthcare Data [13]

Table 1 Healthcare Privacy Laws in some Countries [20]

DISHA (The Digital Information Security in Healthcare Act) INDIA

1.1 Literature review

Author Strengths Weaknesses

1.2 Our contribution

Our contribution to this research is as follows:

2 System and security models

2.1 System model

Fig. 2 System model of our scheme

2.2 Security model

2.3 Design goals

3 Proposed source anonymous scheme for privacy preserving

3.1.1 k-source anonymous

3.1.2 Aggregate signature

3.2 Proposed scheme

Fig. 4 Final transaction format at central data mining server

s1 ⊕ s2 ⊕ s3 ⊕ ... ⊕ sn = s1 ⊕ s2 ⊕ s3 ⊕ ... ⊕ sn (1)

Next, sk (k ∈ [1, n]) is used as key for pseudo-random hash function h as

s1 ⊕ s2 ⊕ s3 ⊕ ... ⊕ sn ⊕ s1 ⊕ s2 ⊕ s3 ⊕ ... ⊕ sn−q = sn−q+1 ⊕ ... ⊕ sn (3)

n Number of EHR systems

Algorithm 1 Encryption and Sign of transaction at EHR system

2. Transaction data d i is encrypted as :

ei = ({0}l ⊕ k1i )|({0}l ⊕ k2i )|...|(d i ⊕ ksq(i)

3. The signature is computed as:

4. Send the final ei and σ i to central data mining server

instead of ki . Remaining zero string encrypted using k j ( j = sq(i). Total n encrypted

Algorithm 2 Authenticate and Decryption at Central Data Mining Server

2. Algorithm return -1 if above step does not satisfied, else it computes:

c j = ⊕h s∈Sa (t| j) f or j = 1, 2, 3, ..., n

transmission request. Otherwise, all EHR systems’ transaction can be recovered by

4 Theoretical and practical analysis of proposed scheme

4.1 Our scheme is source anonymous

ei = e1i | e2i | e3i |...| eni (4)

where, eij = {0}l ⊕ k ij f or j = 1, 2, 3, ..., n, j = sq(i) and eij = d i ⊕ k ij f or j =

Let, two EHR system, E H Ri and E H R j 1 ≤ i ≤ j ≤ n, switch their data. Hence

A = (e11 |... |ei−1

To prove source anonymous property, we have to prove

for any value of i and j, where 1 ≤ i ≤ j ≤ n. To prove above relation, we have

D = {d g p (1) , d g p (2) , d g p (3) , ..., d g p (n) } (8)

Execution of algorithm-1 on the original transactional data D generates the encrypted

Here, we cannot distinguish D and its pseudo-random permutation D in polynomial

Same way we can derive the relationship between A and A as

4.1.1 Collusion resilient scheme

4.1.2 Authenticity and integrity

4.2 Experimental analysis

We implemented our scheme in NetBeans with the system having a configuration

Association Rules Confidence (%)

{Slope = flat ∩ Type of chest 90% 85% 86% 92% 99%

instead of ki . Remaining zero string encrypted using k j ( j = sq(i). Total n encrypted

where, eij = {0}l ⊕ k ij f or j = 1, 2, 3, ..., n, j = sq(i) and eij = d i ⊕ k ij f or j =