You are on page 1of 8

Privacy-Preserving Record Linkage Using

Bloom Filter
Shahidul Islam Khan Maikel Sutradhar Emon Chowdhury
Associate Professor Department Of Computer Science & Department Of Computer Science
Department Of Computer Science & Engineering. &
Engineering. International Islamic University, Engineering.
International Islamic University, Chittagong International Islamic University,
Chittagong Email:michaelsutradhar@gmail.com Chittagong
Email: nayeemkh@gmail.com Email:emonchy35@gmail.com

Abstract — In this modern era, privacy technology, the most concerning issue is how to
preserving of an user information is becoming a protect the security of an user when different
crucial issue. When two databases are linked up, organizations or individuals collect data from
it’s very important case how to keep individual different data sources.
user information secure due to the contain of Much of these records are about people or their
sensitive information about user. At this point of related information. Examples of the fields include
current situation in Bangladesh it is really financial data such as shopping transactions,
important to get preserved their individual data, telecommunication records or electronic health
when different organizations link their records. In the above fields all these data are
databases. We used a technique for privacy sensitive and equally important. When we linkup
preserving record linkage which is called Bloom different data sources, we need to check the quality
Filter, it is a space saving probabilistic data of data and at the same time integrating data in a
structure which is helped to identify individual secure way, so the users data remain safe from the
user uniquely and at the same time keep the user attackers or internal interference. It is the crucial
information secure. Record linkage has various issue to identify a user uniquely from dissimilar
challenges, including scalability of large data sources. Record linkage is one of the
databases, accurate matching, classification, substantial areas of data integration. It is the task of
privacy and confidentiality. By using this privacy finding records in a data set that refers to the same
technique we can easily secure each and every entity over diverse data sources (examples are data
user records and then identify them from those files, books, websites and databases). So when we
records. So that, the attacker cannot access these do the record linkage task it may be possibility to
information when the databases are linked for leak the highly sensitive information of an user
various purposes. The objective of the paper is to which is very harmful for an organization as well as
develop the security issue of an user with this users also. For that various process and protocols
privacy technique in an efficient way when the have been developed to protect the privacy of
user information are linked. individual and to maintain the security of data.
Our work is to secure the record linkage task and
Keywords— record linkage, bloom filter, privacy at the same time keep the user information safe. For
preserving,confidentiality, classification, scalability. this reason we used a privacy preserving technique
1. INTRODUCTION which is named as bloom filter. By using this
technique we are able to ensure security of an user
In today’s world, privacy is a great concern for the
information and checked the same entity from
individual users, who are connected to different
multiple datasets.

Electronic copy available at: https://ssrn.com/abstract=4237453


2. BACKGROUND AND RELATED WORK 700.000 hashes per second on their hardware. This
Privacy in record linkage was first explored in the equates to a run time over of over 50 days to
mid-1990s and since then privacy preserving record complete a linkage of their smaller file. In two
linkage (PPRL) has been an vital research area. A datasets, the first one named WA contains
diversity of procedures have been created to allow 6,772,949 records and the NSW dataset contains
linkage using encoded or encrypted values of quasi- 19,874,083 records. For WA unencrypted linkage
identifying attributes over two or more databases. precision is 0.999, recall 0.981, f-measure 0.990 and
Record linkage could be a broadly utilized the result for encrypted linkage using bigrams is
information pre-processing and information precision 0.998, recall 0.981, f-measure 0.990. For
cleaning task where the aim is to link and integrate NSW unencrypted linkage precision is 0.986, recall
records that allude to the same entity from two or 0.972, f-measure 0.979 and the result for encrypted
multiple different databases. The record pairs (when linkage using bigrams is precision 0.985, recall
linking two databases) or record sets (when linking 0.970, f-measure 0.978.
more than two databases) are compared and
classified as ‘matches’ by a linkage model if they Dinusha Vatsalan, Peter Christen discussed
are expected to allude to the same entity, or as ‘non- efficient Bloom filter-based masking approaches for
matches’ if they are assumed to allude to different numerical data [5] and develop a solution for the
entities. The frequent nonattendance of unique privacy-preserving SPM (PP-SPM) problem. There
entity identifiers over the databases to be linked main contributions are (1) based on bloom filter
makes it impossible to utilize a straightforward masking techniques that have shown to be
SQL-join, and therefore linkage requires advanced successful for approximate matching of string data
comparisons between a set of QIDs (such as names [3,6] , they propose novel Bloom filter masking
and addresses) that are commonly accessible within approaches for numerical data that have similar
the records to be linked. However, these QIDs characteristics as for string data ( i.e., they are
frequently contain individual data and so efficient ,effective and secure), (2) develop a
uncovering or trading them for linkage is not comprehensive framework for PP-SPM using
conceivable due to security and confidentiality Bloom filter masking-based similarity calculations
concerns. for matching different types of data, (3) report on
an extensive empirical study of their framework on
Sean M. Randall, Anna M. Ferrante, James H. three publicly available real-world datasets.
Boyd, Jacqueline K. Bauer, James B. Semmens They evaluated the accuracy and privacy of their
adopted the bloom filter method for privacy proposed Bloom-filter based numerical data
preserving record linkage [1] developed by Schnell matching approach for PP-SPM. They used 10,000
et.al [2] .The method appears robust and well- generated pairs of numerical values with uniform
developed, with a number of papers investigating its random distribution for three attributes: age (integer
security [3] and processing additions to its method in the range [1, 101]), salary (integer in the range
[4] Maintaining the Integrity of the Specifications. [0,100,000]), and body weight (floating point in the
The use of bloom filters was evaluated to range [0,000,120,000]). They set the parameter
determine its suitability for conducting large scale values for the length of Bloom filters l-1000, the
privacy preserving record linkage. Two datasets, number of hash functions k-10.
comprising in total over 26 million records, were
linked using this method, with results compared to Peter Christen, Thilina Ranbaduge, Dinusha
the linkage of unencrypted data. A probabilistic Vatsalan and Rainer Schnell discussed about precise
linkage framework was adopted to allow large-scale and fast cryptanalysis for bloom filter based
linkage to occur. privacy-preserving record linkage [7], in their paper
They adopt the SHA-1 hash function which is used they extend their recently proposed attack method
by Schnell et al [2] .They calculate approximately [8] which exploits how q-grams are hashed into

Electronic copy available at: https://ssrn.com/abstract=4237453


Bloom Filters. For each bit position, this method reduce the false positives, optimizations in a real
identifies a set of possible and a set of not possible implementation, dedicated designs for diverse
q-grams and then re-identifies attribute values using datasets and proposals to enable more
only the sets of possible q-grams , However as their functionalities.
experiments these sets have low precision which
leads to low re-identification accuracy . In contrast, Ibrahim Lazrig, Toan C. Ong, Indrajit Ray,
their novel attack method uses the sets of not Indrakshi Ray, Xiaoqian Jiang, Jaideep Vaidya
possible q-grams which have higher precision. They published a paper named Privacy Preserving
conducted experiments using three data sets from Probabilistic Record Linkage Without Trusted
two countries. The first is a pair of ‘NCVR’ datasets Third Party [10]. In their work, they focused on
(the database to be attacked) and (the plaintext PPRL across two datasets that contain records
database). They extracted two subsets containing belonging to the same entity that need to be linked
224,073(2014) and 224,061 (2016) records, but a deterministic match is infeasible because of
respectively. The second pair of data sets ‘UKCD’ the discrepancies in the values of the attributes
are census records collected from the years 1851 being matched. Their work is based on the use of
(the sensitive database) and 1861 (the plain-text Bloom filters to perform probabilistic PPRL, and on
database). These data sets contain 17,034 (1851) the concept of garbled circuits proposed by Yao that
and 22,430(1861) records. For both data sets we use allows two parties to securely compute any function
the First name, Surname, and City (NCVR) or f(x, y) without sharing the x and y values. The total
Address (UKCD) attributes, respectively, as well as time taken to link the 10k*10k datasets, was 2199
the concatenation of pairs of these attributes. A seconds (about 36 min) while the non-distributed
third data set was a voter database from Michigan garbled circuit took 30283 seconds (8.5 hours).
containing 7,408,330 records that they used as the The data fields included in both datasets are First
plain-text database available to an attacker in a name, Last name, Social Security Number (SSN),
cross-state attack experiment. Date of birth and Identification (ID). In order to
Lailong Luo, Deke Guo, Richard T.B. Ma, Ori validate the results, they use an identification
Rottenstreich, and Xueshan Luo published a paper number (ID) for each record such that all records
named Optimizing Bloom Filter: Challenges, that belong to the same individual have the same
Solutions, and Comparisons [9]. They survey the identification number. Each dataset contains 10K
existing variants from two dimensions, i.e. records and both datasets have 6K records in
performance and generalization. To improve the common. A record which is in both datasets has the
performance, dozens of variants devote themselves same value in the ID field. Therefore, the values of
to reducing the false positives and easing the the ID field are used as the true answer to verify
implementation. Besides, tens of variants generalize linkage performance.
the BF framework in more scenarios by diversifying There are several task are executed over privacy
the input sets and enriching the output preserving record linkage but there are some
functionalities. In their survey, more than 60 up-to- drawbacks of those task. Firstly, in most of the task
date designs are reviewed and qualitatively yet they are using such algorithm for privacy preserving
systematically analyzed. where cryptanalysis attack can be done due to use
Despite its space-efficiency, BF still faces some of less secure hash functions. Secondly, increasing
challenges related to false positives, the number of hashes reduce performance of Bloom
implementation, elasticity, and functionality to Filter. Thirdly, in most cases they use Bloom Filter
some extent. To ease the potential challenges and for string attributes or numerical attributes
further improve the performance of BF, they seperately.
improved BF from four angles, i.e., techniques to

Electronic copy available at: https://ssrn.com/abstract=4237453


3. METHODS
We used a privacy technique which is called Bloom Filter and we also used two party protocol for our
works. Because it can be consider more secure as they do not rely on the existence of third party and
requires greater amount of records [11]. Fig 3.1 shows the workflow of our methodology.

Figure 3.1 Workflow of our Methodology

Electronic copy available at: https://ssrn.com/abstract=4237453


3.1 Datasets b)Encode the variable using variable.encode()
For our work, we used two individual datasets, which converts string into bytes.
named as Dataset A and Dataset B. We have c)Apply hexdigest() function to the encoded data,
collected the dataset from popular machine learning which produce 56 digit length value and represents
website Kaggle. In dataset A it contains 1000 the data in a hexadecimal format.
unique data for patient and we used three attributes d)Using int() function to represent the hexa value
(patient_name, patient_gender, patient_dob) for our into a decimal form.
work. Similarly, in dataset B it contains 500 data
where 400 data are duplicates from dataset A and
both of the datasets are complete. Algorithm 3.5.2 (Secure Hash Algorithm 256):

3.2 Generalized the attribute (DOB) a)Variable = string which will be added to BF.
As the attribute named patient_dob is in b)Encode the variable using variable.encode()
(dd/mm/yyyy) format. So we generalized this by which converts string into bytes.
using these steps. c)Apply hexdigest() function to the encoded data,
 Convert patient_dob (dd/mm/yyyy) into age. which produce 64 digit length value and represents
 Set the ages in particular ranges. the data in a hexadecimal format.
 For every particular ranges we use special d)Using int() function to represent the hexa value
symbol. into a decimal form.

3.3 Concatenating attributes Now, the formula of double hashing technique [2]
In this section, we are concatenating our attributes is:
such as patient_name, patient_gender, patient_dob Double hashing = h1(x) + i*h2(x)
by using some special symbol. where h1 = first hash function (SHA256)
For instance, patient_name + ‘$’ + patient _gender h2 = second hash function (SHA224)
+ ‘%’ + patient_dob x = value/records
and i ranges from 0 to k-1
3.4 Element insertion in Bloom Filter where k is the number of hash function.
All the elements one by one will be added to
Bloom Filter 1 which is same for Bloom Filter 2.
Before inserting all the elements in Bloom Filters, 3.6 Compute Modulus of double hash
the bit size of bloom filter is initially set to 0. To compute the bit position of bloom filter, we
have to use the result of double hash and mod this
3.5 Apply double hashing technique result by the size of bloom filter.
In this section, we will discuss about double Bit position = (Double hashing) %
hashing method and how our algorithm works. To Bloom_Filter_Size
reduce the collisions we apply double hashing
technique to our record linkage task and for the
security purpose we encrypt all our inserted 3.7 Set Bit Positions
elements in bloom filter. We used two secure hash Now, we set the bit positions for every element in
algorithm named as SHA224 and SHA256. bloom filter. These positions in the bloom filter are
then changed to 1 .When all required elements are
Algorithm 3.5.1 (Secure Hash Algorithm 224): added in this way, the bloom filter is completed and
ready for comparison. Fig 3.2 shows how the bloom
a)Variable = string which will be added to BF. filter set bit positions for an elements.

Electronic copy available at: https://ssrn.com/abstract=4237453


Number Matches Non- Average
of hash Matches Time(Second)
function
1 28 72 0.030
2 21 79 0.035
3 20 80 0.038
4 16 84 0.040
Figure 3.2 Bit Position of Bloom Filter
Table 4.1: Calculate Average time for hash
4. Result Analysis and Discussion functions.

4.1 Environment Setup


Number of Precision Recall Accuracy
Implementation requirements: The hash
implementation tools that are used in this research function
are listed below: 1 0.57 1 0.88
• Personal computer 2 0.76 1 0.95
• 64-bit Windows 7 operating system 3 0.8 1 0.96
• Python 3.7.0 4 1.0 1 1.0
• Anaconda 4.4.0
• Spyder Table 4.2: Calculate precision, recall, accuracy for
hash functions.
Working environment: The machine used for the
experiment was a Windows machine running 64-bit In test result 1, we measure the performance of our
Windows 7 operating system on a Core i5 2.2 GHz bloom filter for different hash function and see how
with 4GB RAM personal computer. The machine quickly it responds. We also measure the precision,
has a 4.00 GB RAM capacity and 750 GB hard-disk recall and accuracy for hash functions. The use of
space. more hash function will increase the accuracy and
decrease the false positive.

# Test Result 2:
4.2 Experimental Setup The size of bloom filter : 2200 bit
Dataset A : 1000 records
Dataset: We experimented on a patient admission Dataset B : 500 records (duplicate 400
dataset which we have downloaded from Kaggle records)
[12]. The dataset1 contains 1000 text data and
dataset2 contains 500 text data which are complete.
Number Matches Non- Average
of hash Matches Time(Second)
4.3 Result Analysis function
1 525 475 0.270
# Test Result 1: 2 478 521 0.300
The size of bloom filter: 1024 bit 3 458 542 0.350
Dataset A : 100 records 4 455 545 0.400
Dataset B : 100 records (duplicate 16 records)
Table 4.3: Calculate Average time for hash
functions.

Electronic copy available at: https://ssrn.com/abstract=4237453


Number of Precision Recall Accuracy In test result 2, we have got precision 0.879, Recall
hash 1 and Accuracy 0.945 for using 4 hash functions.
function In both results, 4 is the optimal hash function, by
1 0.761 1 0.875 using this we can increase the performance of
2 0.835 1 0.921 bloom filter as well as control the false positives.
3 0.873 1 0.942
4 0.879 1 0.945 Result Comparison

Table 4.4: Calculate precision, recall, accuracy for Test result 1 (100):
hash functions.
In test result 2, for 1000 data we measure the No of hash Precision Recall Accuracy
performance of our bloom filter for different hash function
function and see how quickly it works for large 4 0.88 1 0.98
data. We also measure the precision, recall and
accuracy for hash functions in terms of records. As Table 4.7: Calculate precision, recall, accuracy for
same as test result 1, the use of more hash function 4 hash function.
will increase the accuracy and decrease the false
positive. In test result 1, we have got precision 0.88, Recall 1
and Accuracy 0.98 for using 4 hash functions.
Test result 1 (100):
Test result 2 (1000):
No of Precision Recall Accuracy
hash No of hash Precision Recall Accuracy
function function
4 1 1 1 4 0.83 1 0.92

Table 4.5: Calculate precision, recall, accuracy for Table 4.8: Calculate precision, recall, accuracy for
4 hash function. 4 hash function.

In test result 1, we have got precision 1, Recall 1 In test result 2, we have got precision 0.83, Recall 1
and Accuracy 1 for using 4 hash functions. and Accuracy 0.92 for using 4 hash functions.

Test result 2 (1000): By using same datasets for different hash


algorithms which have been used earlier is SHA1
No of Precision Recall Accuracy [1, 2] and MD5 [2] in our record linkage task and
hash we have got different result. So by our applied
function algorithms we have got better result and better
4 0.879 1 0.945 performance.

4.4 Discussion
Table 4.6: Calculate precision, recall, accuracy for Within the result of our observational ponder
4 hash function. appear the adequacy and proficiency of our privacy

Electronic copy available at: https://ssrn.com/abstract=4237453


filters. BMC medical informatics and decision
preserving record linkage assignment using Bloom making, 9(1), 41.
[4] Kuzu, M., Kantarcioglu, M., Durham, E., &
Filter. We compare the result by using different Malin, B. (2011, July). A constraint satisfaction
hash functions and show how it’s reflects on cryptanalysis of bloom filters in private record
the result such as time, precision, recall and linkage. In International Symposium on Privacy
accuracy. We also described about the false positive Enhancing Technologies Symposium (pp. 226-
probability and show how it will change by 245). Springer, Berlin, Heidelberg.
[5] Schnell, R., Bachteler, T., & Reiher, J. (2011).
increasing the number of hash function. The use of A novel error-tolerant anonymous linking
more hashes will decrease the false positive and code. German Record Linkage Center, Working
increase the performance and accuracy which are Paper Series No. WP-GRLC-2011-02.
greatly related to linkage quality and security of the [6] Vatsalan, D., & Christen, P. (2016). Privacy-
system. preserving matching of similar patients. Journal
5. Conclusions of biomedical informatics, 59, 285-298.
[7] Durham, E. A. (2012). A framework for
By using Bloom Filter for privacy preserving accurate, efficient private record
record linkage, we can easily encrypt and compare linkage (Doctoral dissertation, Vanderbilt
the individual records between two or more datasets University).
and uniquely identify the records of user. To keep [8] Christen, P., Ranbaduge, T., Vatsalan, D., &
safe user information from external interference or Schnell, R. (2018). Precise and fast
cryptanalysis for Bloom filter based privacy-
attacker, we ensure security to the system by using preserving record linkage. IEEE Transactions
several process so that the organizations or on Knowledge and Data Engineering.
individuals can feel comfort to share their personal [9] Christen, P., Schnell, R., Vatsalan, D., &
information. We hoped, by our work we will keep Ranbaduge, T. (2017, May). Efficient
user information more secure and also checked the cryptanalysis of bloom filters for privacy-
preserving record linkage. In Pacific-Asia
linkage quality by using different parameters. Conference on Knowledge Discovery and
In the future, we wish to develop the privacy Data Mining (pp. 628-640). Springer, Cham.
preserving record linkage task by applying other [10] Luo, L., Guo, D., Ma, R. T., Rottenstreich, O.,
privacy based technique in terms of more & Luo, X. (2018). Optimizing Bloom filter:
performance. And we would like to work on to Challenges, solutions, and comparisons. IEEE
Communications Surveys & Tutorials, 21(2),
check the performance and accuracy of these 1912-1949.
techniques when records are linked up during [11] Lazrig, I., Ong, T. C., Ray, I., Ray, I., Jiang, X.,
record linkage process and also want to test how the & Vaidya, J. (2018, August). Privacy preserving
records will be secured in various format probabilistic record linkage without trusted third
party. In 2018 16th Annual Conference on
Privacy, Security and Trust (PST) (pp. 1-10).
6. References IEEE.
[12] Vatsalan, D., Christen, P., & Verykios, V. S.
[1] Randall, S. M., Ferrante, A. M., Boyd, J. H., (2013). A taxonomoy of privacy preserving
Bauer, J. K., & Semmens, J. B. (2014). Privacy- record linkage techniques Information
preserving record linkage on large real world Systems, 38(6), 946-969
datasets. Journal of biomedical informatics, 50,
205-212
[13] Randall, S. M., Ferrante, A. M., Boyd, J. H.,
[2] Randall, S. M., Ferrante, A. M., Boyd, J. H., Brown, A. P., & Semmens, J. B. (2016).
Bauer, J. K., & Semmens, J. B. (2014). Privacy- Limited privacy protection and poor sensitivity:
preserving record linkage on large real world Is it time to move on from the statistical linkage
datasets. Journal of biomedical informatics, 50, key-581?. Health Information Management
205-212. Journal, 45(2), 71-79.
[3] Schnell, R., Bachteler, T., & Reiher, J. (2009).
Privacy-preserving record linkage using Bloom

Electronic copy available at: https://ssrn.com/abstract=4237453

You might also like