Professional Documents
Culture Documents
Bloom Filter
Shahidul Islam Khan Maikel Sutradhar Emon Chowdhury
Associate Professor Department Of Computer Science & Department Of Computer Science
Department Of Computer Science & Engineering. &
Engineering. International Islamic University, Engineering.
International Islamic University, Chittagong International Islamic University,
Chittagong Email:michaelsutradhar@gmail.com Chittagong
Email: nayeemkh@gmail.com Email:emonchy35@gmail.com
Abstract — In this modern era, privacy technology, the most concerning issue is how to
preserving of an user information is becoming a protect the security of an user when different
crucial issue. When two databases are linked up, organizations or individuals collect data from
it’s very important case how to keep individual different data sources.
user information secure due to the contain of Much of these records are about people or their
sensitive information about user. At this point of related information. Examples of the fields include
current situation in Bangladesh it is really financial data such as shopping transactions,
important to get preserved their individual data, telecommunication records or electronic health
when different organizations link their records. In the above fields all these data are
databases. We used a technique for privacy sensitive and equally important. When we linkup
preserving record linkage which is called Bloom different data sources, we need to check the quality
Filter, it is a space saving probabilistic data of data and at the same time integrating data in a
structure which is helped to identify individual secure way, so the users data remain safe from the
user uniquely and at the same time keep the user attackers or internal interference. It is the crucial
information secure. Record linkage has various issue to identify a user uniquely from dissimilar
challenges, including scalability of large data sources. Record linkage is one of the
databases, accurate matching, classification, substantial areas of data integration. It is the task of
privacy and confidentiality. By using this privacy finding records in a data set that refers to the same
technique we can easily secure each and every entity over diverse data sources (examples are data
user records and then identify them from those files, books, websites and databases). So when we
records. So that, the attacker cannot access these do the record linkage task it may be possibility to
information when the databases are linked for leak the highly sensitive information of an user
various purposes. The objective of the paper is to which is very harmful for an organization as well as
develop the security issue of an user with this users also. For that various process and protocols
privacy technique in an efficient way when the have been developed to protect the privacy of
user information are linked. individual and to maintain the security of data.
Our work is to secure the record linkage task and
Keywords— record linkage, bloom filter, privacy at the same time keep the user information safe. For
preserving,confidentiality, classification, scalability. this reason we used a privacy preserving technique
1. INTRODUCTION which is named as bloom filter. By using this
technique we are able to ensure security of an user
In today’s world, privacy is a great concern for the
information and checked the same entity from
individual users, who are connected to different
multiple datasets.
3.2 Generalized the attribute (DOB) a)Variable = string which will be added to BF.
As the attribute named patient_dob is in b)Encode the variable using variable.encode()
(dd/mm/yyyy) format. So we generalized this by which converts string into bytes.
using these steps. c)Apply hexdigest() function to the encoded data,
Convert patient_dob (dd/mm/yyyy) into age. which produce 64 digit length value and represents
Set the ages in particular ranges. the data in a hexadecimal format.
For every particular ranges we use special d)Using int() function to represent the hexa value
symbol. into a decimal form.
3.3 Concatenating attributes Now, the formula of double hashing technique [2]
In this section, we are concatenating our attributes is:
such as patient_name, patient_gender, patient_dob Double hashing = h1(x) + i*h2(x)
by using some special symbol. where h1 = first hash function (SHA256)
For instance, patient_name + ‘$’ + patient _gender h2 = second hash function (SHA224)
+ ‘%’ + patient_dob x = value/records
and i ranges from 0 to k-1
3.4 Element insertion in Bloom Filter where k is the number of hash function.
All the elements one by one will be added to
Bloom Filter 1 which is same for Bloom Filter 2.
Before inserting all the elements in Bloom Filters, 3.6 Compute Modulus of double hash
the bit size of bloom filter is initially set to 0. To compute the bit position of bloom filter, we
have to use the result of double hash and mod this
3.5 Apply double hashing technique result by the size of bloom filter.
In this section, we will discuss about double Bit position = (Double hashing) %
hashing method and how our algorithm works. To Bloom_Filter_Size
reduce the collisions we apply double hashing
technique to our record linkage task and for the
security purpose we encrypt all our inserted 3.7 Set Bit Positions
elements in bloom filter. We used two secure hash Now, we set the bit positions for every element in
algorithm named as SHA224 and SHA256. bloom filter. These positions in the bloom filter are
then changed to 1 .When all required elements are
Algorithm 3.5.1 (Secure Hash Algorithm 224): added in this way, the bloom filter is completed and
ready for comparison. Fig 3.2 shows how the bloom
a)Variable = string which will be added to BF. filter set bit positions for an elements.
# Test Result 2:
4.2 Experimental Setup The size of bloom filter : 2200 bit
Dataset A : 1000 records
Dataset: We experimented on a patient admission Dataset B : 500 records (duplicate 400
dataset which we have downloaded from Kaggle records)
[12]. The dataset1 contains 1000 text data and
dataset2 contains 500 text data which are complete.
Number Matches Non- Average
of hash Matches Time(Second)
4.3 Result Analysis function
1 525 475 0.270
# Test Result 1: 2 478 521 0.300
The size of bloom filter: 1024 bit 3 458 542 0.350
Dataset A : 100 records 4 455 545 0.400
Dataset B : 100 records (duplicate 16 records)
Table 4.3: Calculate Average time for hash
functions.
Table 4.4: Calculate precision, recall, accuracy for Test result 1 (100):
hash functions.
In test result 2, for 1000 data we measure the No of hash Precision Recall Accuracy
performance of our bloom filter for different hash function
function and see how quickly it works for large 4 0.88 1 0.98
data. We also measure the precision, recall and
accuracy for hash functions in terms of records. As Table 4.7: Calculate precision, recall, accuracy for
same as test result 1, the use of more hash function 4 hash function.
will increase the accuracy and decrease the false
positive. In test result 1, we have got precision 0.88, Recall 1
and Accuracy 0.98 for using 4 hash functions.
Test result 1 (100):
Test result 2 (1000):
No of Precision Recall Accuracy
hash No of hash Precision Recall Accuracy
function function
4 1 1 1 4 0.83 1 0.92
Table 4.5: Calculate precision, recall, accuracy for Table 4.8: Calculate precision, recall, accuracy for
4 hash function. 4 hash function.
In test result 1, we have got precision 1, Recall 1 In test result 2, we have got precision 0.83, Recall 1
and Accuracy 1 for using 4 hash functions. and Accuracy 0.92 for using 4 hash functions.
4.4 Discussion
Table 4.6: Calculate precision, recall, accuracy for Within the result of our observational ponder
4 hash function. appear the adequacy and proficiency of our privacy