Professional Documents
Culture Documents
Research Proposal of DBMSNO
Research Proposal of DBMSNO
net/publication/278964082
CITATIONS
READS
0
2,933
2 authors:
Enis Karaarslan
Feriştah Dalkılıç
Mugla Üniversitesi
Dokuz Eylul University
119 PUBLICATIONS 404 CITATIONS
26 PUBLICATIONS 50 CITATIONS
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Enis Karaarslan on 22 June 2015.
Abstract
Research datasets are collected during research projects all around the world. The value of research datasets
continues to exist also after the projects. It is important to keep them available to be used for further researches.
However, there are practical, legal and ethical issues in archiving and sharing this research data. There should
be mechanisms by which researchers can share scientific data while preserving data privacy and confidentiality.
In this paper, a web based research data management system is proposed where researchers can upload and
share their research data. A web interface with the user authentication is generated as a prototype. The user will
be able to upload the research data, and can specify which fields to be anonymized. CryptDB is used to keep the
specified database fields encrypted to provide privacy and confidentiality. The user password for logging in the
system is used to create the encryption/decryption key. The platform will also provide services that researchers
can share their research data while preserving the privacy of the data. The modules will include privacy tests like
k- anonymity. Research data can be shared fully or partially between researchers. It is aimed to provide data
summarization and statistical methods as services to the users. These services can also be used to create subsets
of the research data. Surveillance, Epidemiology, and End Results Program (SEER) research data will be used as
the test data source in this work. Implementation details, performance issues and recommendations will be given
in the paper.
Keywords: Confidentiality, information privacy, encryption, privacy test, research data management
1. INTRODUCTION
Research datasets are collected during research projects all around the world. The value of research datasets
continues to exist also after the projects. It is important to keep them available to be used for further researches.
Research data sharing is discussed in [1-3]. As stated in [3], we need ways to make these research data discoverable,
available and reusable by others.
There are practical, legal and ethical issues in archiving and sharing the research data. There should be mechanisms
by which researchers can share scientific data while preserving data privacy and confidentiality. Confidentiality
covers two related concepts, which are data confidentiality and privacy. Data confidentiality satisfies that there can
be only authorized access to private/confidential data. Privacy satisfies that the owners of the data control the data
collected about them and who stores and access that data [4].
In this paper, a solution which is a step to solve the confidentiality and privacy issues of the research data
management process is proposed. In the first section, the proposed model RDMS-CP will be explained. Then the
implementation and methods will be given. Data sharing using session keys and privacy preserving data sharing
methods will be explained and possible future work will be given.
We propose RDMS-CP (Research Data Management System with Confidentiality and Privacy) model, which is based
on CryptDB[8] and privacy algorithms. We constrain our model to handle small datasets. The system will provide
services for data summarization and applying statistical methods. Researchers will be able to share research data fully
or partially for a given time period.
Data in its original form typically contains sensitive information about individuals, and publishing such data will
violate individual privacy. Privacy-preserving data publishing (PPDP) provides methods and tools for publishing
useful information while preserving data privacy. Many approaches have been proposed for different data publishing
scenarios. Different approaches to PPDP have been summarized and evaluated systematically as a survey in [9].
Online Data Journals are becoming widely used to share data between researchers. In a recent work [3], more than
100 currently existing data journals are analyzed. There are environments like D4Science.org, which provide social
network environments and means to share data [10]. To our knowledge, a system which also provides confidentiality
and privacy was not done before.
Admin Access: The database administrator (DBA) may try to learn private data. The DBA has root access
on the device and may implement attacks like snooping on the DBMS server.
Attacker Access: The attacker can attack and take the control of application and the DBMS servers. The
system can make sure the confidentiality of logged-out users’ data.
CryptDB allows the DBMS server to execute SQL queries on the data, except some of queries such as selections,
projections, joins, aggregates and orderings are performed on ciphertexts. CryptDB gets queries from the applications
with a secure way and sends them to database servers. After that database server sends encrypted data then CryptDB
decrypts it and sends it to applications. For the encryption system CryptDB uses Random (RND), Deterministic
(DET), Order-preserving encryption (OPE), Join (JOIN and OPE-JOIN), and Word search (SEARCH). Symmetric
encryption is used and user/password is used for the keys. CryptDB uses user-defined functions (UDFs) to perform
cryptographic operation in the DBMS [8]. The system architecture is given in Figure 1.
2
Research Data Management System Proposal Having Confidentiality and Privacy
Feriştah DalkIlIç, Enis Karaarslan
In scope of this study, a database schema has been designed to handle such a system. When creating the data tables, a
data table name convention has been followed. Project description is located at the beginning of the table names
(Example: RDMS_ for Research Data Management System). The second descriptions are 3-character extension
indicating the content-type of the table. ER-Diagram containing the administrative tables and the user tables are given
in Figure 2.
The system can be used by both registered and unregistered users. However, unregistered users have limited rights.
An unregistered user can only search and access to the datasets which are shared publicly. The use-case diagram of
the RDMS can be seen in Figure 3.
When a user login to the system, the user can monitor, download and delete his/her own existing datasets or can
upload a new dataset. If the user wants to share a dataset with other users, he/she able to define a sharing profile.
Some necessary tasks have to be completed before completion of profile definition. The user should specify the
sharing type for each data table and data column as private, anonymized or public. After that, user can share dataset
with other users for a specified time period by specifying their user names or e-mail addresses. When the database
owner decides to share a database publicly, anyone using the system can search and access this database. A sample
user interface for sharing datasets is given in Figure 4.
3
ICENS International Conference on Engineering and Natural Science, 15-19 May 2015, Skopje, Macedonia
A registered user can display and download the datasets which have been shared with him/her by other users from the
“Shared with me” option located at the left frame of the page. User cannot access to the private tables and columns,
and the tables that are shared with him are anonymized.
The design of pages is simple and common UI elements are used to feel users more comfortable. The flow of the
usage is similar with document sharing systems, ensuring user a familiar environment and enabling the user to get
things done quicker. Pre-chosen fields reduce the burden on the user. Unnecessary elements are avoided and plain
language is used on labels and in messaging.
4
Research Data Management System Proposal Having Confidentiality and Privacy
Feriştah DalkIlIç, Enis Karaarslan
3. EXPERIMENTAL
An ordinary PC with 8 GB Ram, i5 3.20 Ghz processor is used as a server. CryptDB is installed on a Linux platform.
Ubuntu 12.04 is preferred as the setup processes were clearly defined and tested for this version. Mysql 5.5.14 is used
as database, Apache 2.2.22 used as a web server. Setup procedures and installation files of CryptDB can be found the
official website of the CryptDB [11]. Web interface is coded in JAVA. Eclipse platform is used when developing the
interface.
SEER Research Data [12] is used as a test dataset. This dataset is anonymized so no privacy methods will be applied
to it. We used this dataset to test the system in uploading data to the database and implementing privacy tests. We
also generated a hospital (patient/doctor) dataset by using dummy values. Anonymization and privacy tests are
implemented on this dataset.
Uploading the datasets encrypted to the database, executing queries on these datasets, reading data from the
encrypted database and sharing dataset to another researcher on the system are the implementations that are made in
the prototype.
In this study, two functions will be discussed: “Data sharing using session keys” and “Privacy preserving data
sharing”. During the data sharing phase, user A wants to share a dataset with user B. Session-keys are used and the
process is shown in Figure 5. The process is as follows:
User A reads the dataset from the database decrypting with his/her own key (Key A)
User A shares this dataset with User B by using the session key.
User B writes the dataset to the database encrypting with his/her own key (Key B)
Before mentioning about the privacy preserving data sharing, it should be useful to explain the privacy threats. We
can list privacy attack models as Record Linkage, Attribute Linkage, Table Linkage and Probabilistic Attack [9]. In
all types of linkages attacks, we assume that the attacker knows the attributes that could potentially identify the
record owner. In record and attribute linkages, we also assume that the attacker knows that the victim’s record is in
the released table, and tries to identify the victim’s record and/or sensitive information from the table. In table
linkage, the attack seeks to determine the presence or absence of the victim’s record in the released table. A data
table is
5
ICENS International Conference on Engineering and Natural Science, 15-19 May 2015, Skopje, Macedonia
considered to be privacy-preserving if it can effectively prevent the attacker from successfully performing these
linkages.
To prevent record linkage, the k-anonymity privacy model has been proposed [7, 13]. If one record in the table has
some value on the attributes that could potentially identify record owner, at least k−1 other records also have the
same value. In other words, the minimum group size on these attributes is at least k. A table satisfying this
requirement is called k-anonymous.
The proposed model will provide some tools for implementing privacy tests like k-anonymity to understand the
privacy level of the resulting dataset. Bigger k-anonymity values mean better privacy level [14]. Data owner
(researcher) will be able to implement the following methods to increase the privacy before sharing the data:
Data summarization
Statistical methods
The proposed system is not optimized for big data and current model will only be sufficient to share small datasets.
However, we believe that this will also be sufficient for most the surveys and we will also work on enhancing this
model in the following studies.
However, we encountered some challenges when we wanted to implement data sharing between users by using
session keys. We found some solutions about the issue and working on it.
5. CONCLUSIONS
Data sharing by preserving privacy and confidentiality is a promising approach to information sharing while data
sharing increasing between researchers, companies, and organizations. In this study, we proposed a model where
researchers can share their data with confidentiality and privacy. The proposed solution writes the datasets encrypted
to the database and the performance impact is between %14 to %25 depending on the system. The web interface is
easy to use and built in functions will ensure privacy of the shared data. We think that providing such services will
increase data sharing among researchers and several academic studies will benefit from it. Possible future works may
include adding social networking services, artificial intelligence and big dataset handling. We are also working on
using this model on different implementation areas.
ACKNOWLEDGEMENTS
This research is a joint work between MSKU NetSecLab (http://netseclab.mu.edu.tr/) and DEU SRG | Security
Research Group (http://srg.cs.deu.edu.tr/wp/).
We'd like to thank Mehmet Beşir Eren for his involvement (installing and testing the test environment) in this project.
REFERENCES
[1]. C. L. Borgman, “Research Data: Who will share what, with whom, when, and why?,” China-North America Library
Conference, Beijing, 2010
[2]. J. Furner, “The Conundrum of Sharing Research Data,” Journal of the American Society for Information Science and
Technology, vol. 62, 6, 2011.
[3]. L. Candela, D. Castelli, P. Manghi, A. Tani, “Data journals: A survey,” Journal of the Association for Information Science and
Technology, 2015.
[4]. S. William, Cryptography and Network Security: Principles and Practice, 5th ed., Pearson/Prentice Hall, 2011.
6
Research Data Management System Proposal Having Confidentiality and Privacy
Feriştah DalkIlIç, Enis Karaarslan
[5]. (2013) OECD Guidelines on the Protection of Privacy and Transborder Flows of Personal Data. [Online]. Available:
http://www.oecd.org/sti/ieconomy/oecdguidelinesontheprotectionofprivacyandtransborderflowsofpersonaldata.htm
[6]. T. Dalenius, “Towards a methodology for statistical disclosure control,” Statistik Tidskrift, vol. 15, pp. 429–444, 1977
[7]. P. Samarati, and L. Sweeney, “Generalizing data to provide anonymity when disclosing information,” in Proc. of the 17th
ACM SIGACT-SIGMOD-SIGART (PODS), New York, 188, 1998.
[8]. R. A. Popa, C. M. S. Redfield, N. Zeldovich, H. Balakrishnan, “CryptDB: protecting confidentiality with encrypted query
processing,” in Proc. of the 23rd ACM Symposium on Operating Systems Principles. ACM, 2011.
[9]. B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu, “Privacy-preserving data publishing: A survey of recent developments,” ACM
Computing Surveys (CSUR), vol. 42, 4(14), 2010.
[10]. M. Assante, L. Candela, D. Castelli, F. Mangiacrapa, P. Pagano, I. Italian, “A Social Networking Research Environment for
Scientific Data Sharing: The D4Science Offering,” An Int. J. Grey Lit. 10, pp. 151–158, 2014.
[11]. CryptDB. [Online]. Available: http://css.csail.mit.edu/cryptdb
[12]. Surveillance, Epidemiology, and End Results (SEER) Program Research Data, National Cancer Institute, DCCPS,
Surveillance Research Program, Surveillance Systems Branch, released April 2014, based on the November 2013 submission.
[Online]. Available: www.seer.cancer.gov
[13]. P. Samarati, and L. Sweeney, “Protecting privacy when disclosing information: its enforcement through generalization and
suppression,” SRI International, Tech. Rep., 1998.
[14]. J. Sedayao, R. Bhardwaj, N. Gorade, “Making Big Data, Privacy, and Anonymization work together in the Enterprise:
Experiences and Issues,” in Proc. of the 3rd International Congress on Big Data, Anchorage, Alaska, pp. 601 – 607, 2014.