You are on page 1of 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/278964082

Research Data Management System Proposal Having Confidentiality and


Privacy

Conference Paper · May 2015

CITATIONS
READS
0
2,933

2 authors:

Enis Karaarslan
Feriştah Dalkılıç
Mugla Üniversitesi
Dokuz Eylul University
119 PUBLICATIONS 404 CITATIONS
26 PUBLICATIONS 50 CITATIONS

Some of the authors of this publication are also working on these related projects:

Data Privacy and Security View project

Blockchain-based Secure Systems View project

All content following this page was uploaded by Enis Karaarslan on 22 June 2015.

The user has requested enhancement of the downloaded file.


Research Data Management System
Proposal Having Confidentiality and Privacy
Feriştah Dalkılıç1, Enis Karaarslan2

Abstract
Research datasets are collected during research projects all around the world. The value of research datasets
continues to exist also after the projects. It is important to keep them available to be used for further researches.
However, there are practical, legal and ethical issues in archiving and sharing this research data. There should
be mechanisms by which researchers can share scientific data while preserving data privacy and confidentiality.
In this paper, a web based research data management system is proposed where researchers can upload and
share their research data. A web interface with the user authentication is generated as a prototype. The user will
be able to upload the research data, and can specify which fields to be anonymized. CryptDB is used to keep the
specified database fields encrypted to provide privacy and confidentiality. The user password for logging in the
system is used to create the encryption/decryption key. The platform will also provide services that researchers
can share their research data while preserving the privacy of the data. The modules will include privacy tests like
k- anonymity. Research data can be shared fully or partially between researchers. It is aimed to provide data
summarization and statistical methods as services to the users. These services can also be used to create subsets
of the research data. Surveillance, Epidemiology, and End Results Program (SEER) research data will be used as
the test data source in this work. Implementation details, performance issues and recommendations will be given
in the paper.

Keywords: Confidentiality, information privacy, encryption, privacy test, research data management

1. INTRODUCTION

Research datasets are collected during research projects all around the world. The value of research datasets
continues to exist also after the projects. It is important to keep them available to be used for further researches.
Research data sharing is discussed in [1-3]. As stated in [3], we need ways to make these research data discoverable,
available and reusable by others.

There are practical, legal and ethical issues in archiving and sharing the research data. There should be mechanisms
by which researchers can share scientific data while preserving data privacy and confidentiality. Confidentiality
covers two related concepts, which are data confidentiality and privacy. Data confidentiality satisfies that there can
be only authorized access to private/confidential data. Privacy satisfies that the owners of the data control the data
collected about them and who stores and access that data [4].

1 Corresponding author: Dokuz Eylül University, Department of Computer Engineering, İzmir,


feristah@cs.deu.edu.tr
2 Muğla Sıtkı Koçman University, Department of Computer Engineering, Muğla,
enis.karaarslan@mu.edu.tr
Research Data Management System Proposal Having Confidentiality and Privacy
Feriştah DalkIlIç, Enis Karaarslan
Some researchers do not even store their data in the servers which they don’t administer. They want to be sure who
got access to that data and if this work/data is cited in their studies. We need cryptology methods to achieve the
confidentiality so that only authorized access will be possible. Research data also contains some personal data which
will require special handling[5]. In order to satisfy privacy, some fields could be anonymized. Anonymization is
hiding the identity and/or the sensitive data of record owners, assuming that sensitive data must be retained for data
analysis [6, 7]. Anonymization may be implemented with some methods like data summarization and/or statistical
methods. Privacy tests like k-anonymity should be used to understand the privacy level of the data. Providing such
services will make the amount of shared datasets increased.

In this paper, a solution which is a step to solve the confidentiality and privacy issues of the research data
management process is proposed. In the first section, the proposed model RDMS-CP will be explained. Then the
implementation and methods will be given. Data sharing using session keys and privacy preserving data sharing
methods will be explained and possible future work will be given.

2. DEVELOPING MODELS FOR RESEARCH DATA MANAGEMENT

We propose RDMS-CP (Research Data Management System with Confidentiality and Privacy) model, which is based
on CryptDB[8] and privacy algorithms. We constrain our model to handle small datasets. The system will provide
services for data summarization and applying statistical methods. Researchers will be able to share research data fully
or partially for a given time period.

2.1. Related Work

Data in its original form typically contains sensitive information about individuals, and publishing such data will
violate individual privacy. Privacy-preserving data publishing (PPDP) provides methods and tools for publishing
useful information while preserving data privacy. Many approaches have been proposed for different data publishing
scenarios. Different approaches to PPDP have been summarized and evaluated systematically as a survey in [9].

Online Data Journals are becoming widely used to share data between researchers. In a recent work [3], more than
100 currently existing data journals are analyzed. There are environments like D4Science.org, which provide social
network environments and means to share data [10]. To our knowledge, a system which also provides confidentiality
and privacy was not done before.

2.2. RDMS-CP Model System Architecture


CryptDB [11] is a system developed in MIT which provides practical and provable confidentiality for database
systems [8, 11]. CryptDB acts like proxy between the application and the database [8]. Data will be written encrypted
to the database. CryptDB a system is used to secure the connection between database server and applications. The
system handles and obstructs the following two threats[8]:

 Admin Access: The database administrator (DBA) may try to learn private data. The DBA has root access
on the device and may implement attacks like snooping on the DBMS server.

 Attacker Access: The attacker can attack and take the control of application and the DBMS servers. The
system can make sure the confidentiality of logged-out users’ data.

CryptDB allows the DBMS server to execute SQL queries on the data, except some of queries such as selections,
projections, joins, aggregates and orderings are performed on ciphertexts. CryptDB gets queries from the applications
with a secure way and sends them to database servers. After that database server sends encrypted data then CryptDB
decrypts it and sends it to applications. For the encryption system CryptDB uses Random (RND), Deterministic
(DET), Order-preserving encryption (OPE), Join (JOIN and OPE-JOIN), and Word search (SEARCH). Symmetric
encryption is used and user/password is used for the keys. CryptDB uses user-defined functions (UDFs) to perform
cryptographic operation in the DBMS [8]. The system architecture is given in Figure 1.

2
Research Data Management System Proposal Having Confidentiality and Privacy
Feriştah DalkIlIç, Enis Karaarslan

Figure 1. System architecture

In scope of this study, a database schema has been designed to handle such a system. When creating the data tables, a
data table name convention has been followed. Project description is located at the beginning of the table names
(Example: RDMS_ for Research Data Management System). The second descriptions are 3-character extension
indicating the content-type of the table. ER-Diagram containing the administrative tables and the user tables are given
in Figure 2.

Figure 2. ER-diagram – administrative tables and user tables

The system can be used by both registered and unregistered users. However, unregistered users have limited rights.
An unregistered user can only search and access to the datasets which are shared publicly. The use-case diagram of
the RDMS can be seen in Figure 3.

When a user login to the system, the user can monitor, download and delete his/her own existing datasets or can
upload a new dataset. If the user wants to share a dataset with other users, he/she able to define a sharing profile.
Some necessary tasks have to be completed before completion of profile definition. The user should specify the
sharing type for each data table and data column as private, anonymized or public. After that, user can share dataset
with other users for a specified time period by specifying their user names or e-mail addresses. When the database
owner decides to share a database publicly, anyone using the system can search and access this database. A sample
user interface for sharing datasets is given in Figure 4.

3
ICENS International Conference on Engineering and Natural Science, 15-19 May 2015, Skopje, Macedonia

Figure 3. Use-Case Diagram

A registered user can display and download the datasets which have been shared with him/her by other users from the
“Shared with me” option located at the left frame of the page. User cannot access to the private tables and columns,
and the tables that are shared with him are anonymized.

The design of pages is simple and common UI elements are used to feel users more comfortable. The flow of the
usage is similar with document sharing systems, ensuring user a familiar environment and enabling the user to get
things done quicker. Pre-chosen fields reduce the burden on the user. Unnecessary elements are avoided and plain
language is used on labels and in messaging.

Figure 4. Sample user interface for sharing datasets

4
Research Data Management System Proposal Having Confidentiality and Privacy
Feriştah DalkIlIç, Enis Karaarslan
3. EXPERIMENTAL

An ordinary PC with 8 GB Ram, i5 3.20 Ghz processor is used as a server. CryptDB is installed on a Linux platform.
Ubuntu 12.04 is preferred as the setup processes were clearly defined and tested for this version. Mysql 5.5.14 is used
as database, Apache 2.2.22 used as a web server. Setup procedures and installation files of CryptDB can be found the
official website of the CryptDB [11]. Web interface is coded in JAVA. Eclipse platform is used when developing the
interface.

SEER Research Data [12] is used as a test dataset. This dataset is anonymized so no privacy methods will be applied
to it. We used this dataset to test the system in uploading data to the database and implementing privacy tests. We
also generated a hospital (patient/doctor) dataset by using dummy values. Anonymization and privacy tests are
implemented on this dataset.

Uploading the datasets encrypted to the database, executing queries on these datasets, reading data from the
encrypted database and sharing dataset to another researcher on the system are the implementations that are made in
the prototype.

In this study, two functions will be discussed: “Data sharing using session keys” and “Privacy preserving data
sharing”. During the data sharing phase, user A wants to share a dataset with user B. Session-keys are used and the
process is shown in Figure 5. The process is as follows:

 User A reads the dataset from the database decrypting with his/her own key (Key A)

 User A shares this dataset with User B by using the session key.

 User B writes the dataset to the database encrypting with his/her own key (Key B)

Figure 5. Dataset Sharing between two users

Before mentioning about the privacy preserving data sharing, it should be useful to explain the privacy threats. We
can list privacy attack models as Record Linkage, Attribute Linkage, Table Linkage and Probabilistic Attack [9]. In
all types of linkages attacks, we assume that the attacker knows the attributes that could potentially identify the
record owner. In record and attribute linkages, we also assume that the attacker knows that the victim’s record is in
the released table, and tries to identify the victim’s record and/or sensitive information from the table. In table
linkage, the attack seeks to determine the presence or absence of the victim’s record in the released table. A data
table is

5
ICENS International Conference on Engineering and Natural Science, 15-19 May 2015, Skopje, Macedonia
considered to be privacy-preserving if it can effectively prevent the attacker from successfully performing these
linkages.

To prevent record linkage, the k-anonymity privacy model has been proposed [7, 13]. If one record in the table has
some value on the attributes that could potentially identify record owner, at least k−1 other records also have the
same value. In other words, the minimum group size on these attributes is at least k. A table satisfying this
requirement is called k-anonymous.

The proposed model will provide some tools for implementing privacy tests like k-anonymity to understand the
privacy level of the resulting dataset. Bigger k-anonymity values mean better privacy level [14]. Data owner
(researcher) will be able to implement the following methods to increase the privacy before sharing the data:

 Not sharing a column contains sensitive information about individuals

 Data summarization

 Statistical methods

4. RESULTS AND DISCUSSION


CryptDB environment is installed and tested without any problems. CryptDB can support operations over encrypted
data for 99.5% of columns seen in the trace. CryptDB has low overhead, reducing throughput by 14.5% for phpBB,
26% for queries from TPC-C compared to unmodified MySQL [8].

The proposed system is not optimized for big data and current model will only be sufficient to share small datasets.
However, we believe that this will also be sufficient for most the surveys and we will also work on enhancing this
model in the following studies.

However, we encountered some challenges when we wanted to implement data sharing between users by using
session keys. We found some solutions about the issue and working on it.

5. CONCLUSIONS

Data sharing by preserving privacy and confidentiality is a promising approach to information sharing while data
sharing increasing between researchers, companies, and organizations. In this study, we proposed a model where
researchers can share their data with confidentiality and privacy. The proposed solution writes the datasets encrypted
to the database and the performance impact is between %14 to %25 depending on the system. The web interface is
easy to use and built in functions will ensure privacy of the shared data. We think that providing such services will
increase data sharing among researchers and several academic studies will benefit from it. Possible future works may
include adding social networking services, artificial intelligence and big dataset handling. We are also working on
using this model on different implementation areas.

ACKNOWLEDGEMENTS
This research is a joint work between MSKU NetSecLab (http://netseclab.mu.edu.tr/) and DEU SRG | Security
Research Group (http://srg.cs.deu.edu.tr/wp/).

We'd like to thank Mehmet Beşir Eren for his involvement (installing and testing the test environment) in this project.

REFERENCES
[1]. C. L. Borgman, “Research Data: Who will share what, with whom, when, and why?,” China-North America Library
Conference, Beijing, 2010
[2]. J. Furner, “The Conundrum of Sharing Research Data,” Journal of the American Society for Information Science and
Technology, vol. 62, 6, 2011.
[3]. L. Candela, D. Castelli, P. Manghi, A. Tani, “Data journals: A survey,” Journal of the Association for Information Science and
Technology, 2015.
[4]. S. William, Cryptography and Network Security: Principles and Practice, 5th ed., Pearson/Prentice Hall, 2011.

6
Research Data Management System Proposal Having Confidentiality and Privacy
Feriştah DalkIlIç, Enis Karaarslan
[5]. (2013) OECD Guidelines on the Protection of Privacy and Transborder Flows of Personal Data. [Online]. Available:
http://www.oecd.org/sti/ieconomy/oecdguidelinesontheprotectionofprivacyandtransborderflowsofpersonaldata.htm
[6]. T. Dalenius, “Towards a methodology for statistical disclosure control,” Statistik Tidskrift, vol. 15, pp. 429–444, 1977
[7]. P. Samarati, and L. Sweeney, “Generalizing data to provide anonymity when disclosing information,” in Proc. of the 17th
ACM SIGACT-SIGMOD-SIGART (PODS), New York, 188, 1998.
[8]. R. A. Popa, C. M. S. Redfield, N. Zeldovich, H. Balakrishnan, “CryptDB: protecting confidentiality with encrypted query
processing,” in Proc. of the 23rd ACM Symposium on Operating Systems Principles. ACM, 2011.
[9]. B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu, “Privacy-preserving data publishing: A survey of recent developments,” ACM
Computing Surveys (CSUR), vol. 42, 4(14), 2010.
[10]. M. Assante, L. Candela, D. Castelli, F. Mangiacrapa, P. Pagano, I. Italian, “A Social Networking Research Environment for
Scientific Data Sharing: The D4Science Offering,” An Int. J. Grey Lit. 10, pp. 151–158, 2014.
[11]. CryptDB. [Online]. Available: http://css.csail.mit.edu/cryptdb
[12]. Surveillance, Epidemiology, and End Results (SEER) Program Research Data, National Cancer Institute, DCCPS,
Surveillance Research Program, Surveillance Systems Branch, released April 2014, based on the November 2013 submission.
[Online]. Available: www.seer.cancer.gov
[13]. P. Samarati, and L. Sweeney, “Protecting privacy when disclosing information: its enforcement through generalization and
suppression,” SRI International, Tech. Rep., 1998.
[14]. J. Sedayao, R. Bhardwaj, N. Gorade, “Making Big Data, Privacy, and Anonymization work together in the Enterprise:
Experiences and Issues,” in Proc. of the 3rd International Congress on Big Data, Anchorage, Alaska, pp. 601 – 607, 2014.

View publication stats

You might also like