You are on page 1of 4

Cloud-based Synchronization of

Distributed File System Hierarchies


Sandesh Uppoor∗ , Michail D. Flouris∗ , and Angelos Bilas∗†

Foundation for Research and Technology - Hellas (FORTH)
Institute of Computer Science (ICS)
100 N. Plastira Av., Vassilika Vouton, Heraklion, GR-70013, Greece
† Department of Computer Science, University of Crete

P.O. Box 2208, Heraklion, GR 71409, Greece


{suppoor, flouris, bilas}@ics.forth.gr

Abstract—As the number of user-managed devices continues software, such as rsync [1], Unison [2] or Sync Butler [3]
to increase, the need for synchronizing multiple file hierarchies in all computers containing the data files and explicitly ini-
distributed over devices with ad hoc connectivity, is becoming a tiate a peer-to-peer synchronization whenever they need to
significant problem. In this paper, we propose a new approach
for efficient cloud-based synchronization of an arbitrary number synchronize their files. This is a ”manual” process requiring
of distributed file system hierarchies. Our approach maintains the user to provide the network addresses of all device(s),
both the advantages of peer-to-peer synchronization with the as well as the synchronization parameters, such as direction
cloud-based approach that stores a master replica online. In of the synchronization, which files to keep or overwrite,
contrast, we do not assume storage of any user’s data in the etc. Furthermore, most of these synchronization applications
cloud, so we address the related capacity, cost, security, and
privacy limitations. Finally, the proposed system performs data operate pair-wise, that is they synchronize only a pair of
synchronization in a peer-to-peer manner, eliminating cost and nodes each time. Managing synchronization for three or more
bandwidth concerns that arise in the ”cloud master-replica” file system hierarchies can be a cumbersome and error-prone
approach. process, especially for the average user without technical
Index Terms—File synchronization, data management, dis- skills [2]. Finally, the synchronization software is usually not
tributed storage, cloud service.
portable across operating systems, not automated (requires
user intervention), and not suitable for small mobile devices
I. I NTRODUCTION
with ad hoc connectivity.
Users increasingly depend on electronic information that is The second approach, cloud master-replica synchronization
processed and stored in a multitude of computers and storage solves some of these issues by employing cloud services
devices, connected by both wired and wireless networks. The to automate synchronization and deal with the small mobile
number and type of devices, has increased recently due to devices with ad hoc connectivity. In the approach followed
the significant growth of affordable smart devices, as well by many cloud-based synchronization and backup services
as the quality and cost of the networking infrastructure. The (e.g. [4], [5], [6], [7]), a master replica with all data to be
increased number of devices and locations and the fact that synchronized is maintained as a central copy in the cloud,
many of these devices have limited storage capacities and while all the user’s file systems synchronize and transfer
exhibit ad hoc network connectivity, such as frequent off- updates to this central copy. This approach with one-way
line operation and/or movement through different networks, synchronization to the master replica in the cloud, offers in-
introduce many challenges in managing user data distributed creased availability and reliability in case of device failure(s),
over multiple devices and many locations. automates synchronization management and does not require
One of the challenges in maintaining distributed user data is technical skills. On the other hand, it makes the assumption
synchronizing replicated data sets between an arbitrary number that all user’s data will be stored online in the cloud. This is
of nodes. This is useful not only for backup purposes, but also not only expensive, because cloud storage capacity costs, but
for off-line operation, where a user with a device of limited also introduces important security and privacy concerns. Fur-
capacity needs to selectively replicate part of a data set, modify thermore, it requires significant (upload) bandwidth between
it off-line and then re-synchronize with the main data set when all replicas and the central copy in the cloud, which may be
networking conditions allow it. limited and/or expensive, while local, peer-to-peer connectivity
Currently there are two main approaches to the distributed between some of the replicas is cheap and high-speed (e.g. a
synchronization issue, the user-controlled peer-to-peer syn- laptop and a mobile phone synchronizing over a home LAN
chronization and the cloud master-replica synchronization with limited upload bandwidth to the Internet).
approach. In the first approach users install synchronization Note that the approaches mentioned above operate at the
978-1-4244-8396-9/10/$26.00 
c 2010 IEEE user level, and not in the operating system, as file systems do.
any number of devices, and address the issues of devices with
ad hoc connectivity. We also assume that the cloud service is
INTERNET able to connect and communicate with the devices over the
Internet, when they are able to connect, and collect metadata
of the filetrees from each device. Each device may periodically
send updates for its filetree(s) to the cloud service, however,
Fig. 1. Cloud service and user devices.
we do not make an assumption on the update model (push or
pull), as long as the filetree metadata on the cloud service are
updated frequently.
Unfortunately, file systems that support replication and offline
operation mandate the use of the specific file system, which 2) Initial synchronization: Before the initial step of our
is often not an option for users. system we assume that the user has configured the cloud
In this paper, we propose a new approach for efficient cloud- service with information on how to connect to the devices
based synchronization of distributed file system hierarchies, and which filetrees need to be synchronized. This can be done
dealing with an arbitrary number of synchronization nodes. through a user-friendly user interface, such as a web interface.
Our approach maintains both the advantages of peer-to-peer In order to retrieve the current filetree state, the cloud
synchronization and the cloud-based master-replica approach. service connects to the devices and gets the updated metadata.
It does not assume storage of any file data in the cloud, so it Then, it calculates the file operations which need to be
addresses the capacity, cost, security, and privacy issues. More- performed across replicas. First time, when a synchronization
over, the proposed solution performs data synchronization in a process is performed between filetrees of different nodes, there
peer-to-peer manner, eliminating cost and bandwidth concerns is no previous state stored in the cloud service, in contrast to
that arise in the ”cloud master-replica” approach. Our main the subsequent synchronization operations. In this initial step,
contribution is the design of new algorithms and protocols we create the synchronization state by performing the basic
that are more complex than the afforementioned approaches, synchronization algorithm between two filetrees, as detailed
but provide more benefits and less drawbacks. in the First Sync Algorithm (II-2). File operations in the
This paper presents the design of our system, Synchronizer. algorithm, e.g. delete, rename, copy, are considered atomic.
Since a system prototype is currently under development, we The copy operation used in the Algorithm II-2 refers to the
do not present implementation details, or evaluation results. transfer of a file from one replica to another. It can be either
The rest of this paper is organized as follows. Section II full file transfer when it is encountered as new, or transfer
discusses the design of Synchronizer and how it addresses the of just the modification when its encountered as modified.
associated challenges. Finally, we draw our conclusions and Rename refers to changing in name of an existing file, or
discuss future work in Section III. moving it across directories within the same device.
After the cloud service computes the operations for all
II. S YSTEM D ESIGN files on all filetrees, it sends them to the replicas to be
In our approach we assume a scenario where a user needs performed through the coordinated peer-to-peer networking
to synchronize a number of data files stored in the file systems protocol discussed in II-5.
of multiple devices (or nodes), each located in a different 3) Stateful synchronization: Once the initial operations sent
location with ad hoc connectivity. In the rest of this section, by the cloud service have been executed on the replicas
we consider the terms ”device” and ”replica” as synonymous. successfully, they are known to be in a synchronized state
Also we denote the hierarchy of directories with many data (or simply “sync state”), where all replicas have the same
files and sub-directories in a single file system as a filetree. For copy of data or filetree. This “sync state” of each replica
simplicity we will assume that a device (or replica) contains computed by the initial synchronization is stored in the cloud
a single filetree to be synchronized. service. Note that this state consists only filetree metadata,
1) Assumptions: Our synchronization scheme requires the not data. Whenever a subsequent synchronization needs to be
use of metadata for every file in all devices. Metadata consists performed, the “diff sync state”, i.e. the changes to the filetree
of file and directory information, i.e. file system metadata, as since the previous sync, can be obtained as a difference of the
well as a cryptographic hash (e.g. SHA256) of a file’s data stored previous sync state and current filetree state read from
content. The hash is used to compare data across files, and the replica.
to detect data modifications of a particular file in a filetree. For a replica A,
Given a file f , its metadata is defined as PA = previous sync state of A
M (f ) = { path (directory), name, size, creation time, CA = current state of A
last modification time, data hash} DA = diff state of A
Besides the devices storing the filetrees, we assume a DA = CA - PA
cloud service, i.e. a node in the Internet cloud that provides The diff states of replicas are carefully analyzed to detect
the synchronization services, as shown in Figure 1. This is individual changes to files. A “file change” refers to either
necessary in order to automate synchronization, scale up to a file content modification, a file rename, a file creation or
Algorithm 1 First Sync for two filetrees. Algorithm 2 File classification.
Let F TA and F TB be two filetrees, on nodes A and B Let PA ,PB ,PC be the previous sync states, and CA ,CB ,CC
respectively. d refers to a directory name (absolute path from the current states of replicas A, B and C respectively.
root of FT), f to a file name (within a directory), and hash(f ) Let DA ,DB ,DC be the diff states between the previous sync
to the hash of a file’s data. state and current state of each replica.
Modified Renamed
for d in F TA do Let DA be the set of Modified files, DA the set of
New Delete
if d exists in F TB , such that d == d then Renamed files, DA the set of New files, and DA the set
for f in d do of Deleted files in replica A after a previous sync.
if f  does not exist in d , such that f  == f then
f is a new file to be copied to d for node X in A, B, C do
else if f  exists in d , such that f  == f then for file f in CX and f  in PX , such that f  == f do
if hash(f ) != hash(f  ) then if hash(fCX ) != hash(fP X ) and
file f exist but is modified name(fCX ) == name(fP X ) then
if fT imestamp > fT imestamp then f is classified as Modified → DX Modified

copy modification to f  else if name(fCX ) != name(fP X ) and


else hash(fCX ) == hash(fP X ) then
copy modification to f f is classified as Renamed → DX Renamed

end if else if f is not in PX then


else if hash(f ) == hash(f  ) then f is classified as New → DX New

same file f exist unmodified else if f  is not in CX then


end if f is classified as Deleted → DX Delete

else if f  exists in d , such that hash(f  ) == else


hash(f ) then f is unmodified
file f exists in F TB with same content but differ- end if
ent name end for
if fT imestamp > fTimestamp then end for
rename f  to name(f )
else
rename f to name(f  ) two replicas may contain changes in different parts of it that
end if need to be merged using a word processing application). Most
end if existing synchronization applications or cloud services require
end for user intervention for resolving conflicts, or just keep all file
else versions with different file names [2], [4].
create d in F TB and copy all files in d to F TB
end if State of F TX /F TY Mod Ren Del
Modified Mod-Mod Mod-Ren Mod-Del
end for Renamed Ren-Mod Ren-Ren Ren-Del
Deleted Del-Mod Del-Ren —
TABLE I
C ONFLICT TYPES .
a file deletion, with respect to its previous sync state. In a
diff state of a replica, the diff state of each file can be in
one of five different types: modified, renamed, deleted, new Conflicts can be categorized in four types, as shown in
or unmodified. The detection of file changes in a replica and Table I. The diff state of any two replicas is shown in each
the computation of diff state is described in Algorithm II-3: dimension in Table I, while the conflict category depends on
File classification. the independent changes made to a file in each replica.
4) Conflict detection and handling: When a file is modified Our algorithm is able to detect all conflicts, however, due
in one or more replicas, we need to find the latest version of to lack of space we only present part of it in Algorithm 3,
the file and propagate it to the other replicas. The case of a that detects a Mod-Mod conflict. A conflict of type Mod-Mod
file being independently modified in more than one replicas, refers to modifications done to a file f independently in two
is considered a conflict between file versions. Detecting the or more replicas.
correct version of the file in the case of conflicts is difficult and Conflict resolution is an important component in every
dangerous. An incorrect decision in the case of a conflict will synchronization system. Our approach offers two alternatives,
result in data loss, since the correct file version is overwritten ”Manual” or ”Auto” mode. Manual mode requires user input
with another. In many cases, resolving the ambiguity of a in selecting which file(s) to keep, whereas Auto mode provides
conflict requires user intervention and merging of files at the the option of setting up some conflict resolution policies.
application level (e.g. a document independently modified in The current options for policies are: (i) Keep the most recent
Algorithm 3 Mod-Mod conflict detection. Cloud Service
Modified
for f in DA do
Modified Modified
if f in DB or f in DC then
f is in Mod-Mod Conflict Replica A (Master)
end if
end for Phases 1,4 2,5 3,6

Replica A Replica B Replica B Replica C Replica D

Fig. 3. Synchronization with a master replica.

Replica D Replica C
with all replicas i,e. phases 1, 2 and 3, during which, all
Fig. 2. Peer-to-peer replica synchronization.
replicas propagate the latest version of updates to the master
replica, the master will contain the latest versions of all files
and directories in all replicas. During the second round of
version depending on timestamps, (ii) Keep all conflicted file phases, i.e. phases 4, 5 and 6, the latest versions of files
versions with different names. Once all conflicts are resolved, and directories from the master replica are propagated to the
either via manual or auto mode, the cloud service starts rest. So, for example, when the master replica containing
computing the operations to be performed by the replicas. A+B+C+D executes phase 4, all updates from C and D are
5) Update propagation: Before discussing processing of propagated to B, but there is no updates from B to master
conflict-free sync state, it is important to discuss the planned replica since it has already been propagated in first round in
propagation of file operations to the replicas. The user is likely phase 1. In case any replica is unreachable or fails during the
to have updated files in many replicas, so the source of file propagation of updates, the process is aborted and operations
updates is usually not a single replica, but several replicas. are recalculated in the cloud service without the faulty replica.
In this case, communication may become complex, because This is necessary to avoid overwriting a newer file version with
it involves transferring files between all replicas. An example an older file, and is orthogonal to the communication scheme
with 4 replicas is shown in Figure 2, where updates from each used (coordinated or all-to-all file update protocol).
replica should be propagated to all other replicas. This requires III. C ONCLUSIONS
all-to-all communication, which is impractical with firewalls
In this paper, we propose a new approach for efficient cloud-
between replicas, and increases protocol complexity to handle
based synchronization of an arbitrary number of distributed
ad hoc communication between replicas (i.e. failures).
file system hierarchies. Our approach maintains both the
Instead of all-to-all communication, we consider an ap-
advantages of peer-to-peer synchronization with the cloud-
proach where one replica acts as the master or coordinator.
based approach that stores a master replica online. In con-
Thus when calculating the update operations between all repli-
trast, we do not assume storage of any user’s data in the
cas, we consider one replica as the master and batch all update
cloud, so we address the related capacity, cost, security and
operations to phases, where each phase batching operations
privacy limitations. Finally, the proposed system performs
between the master and one other replica. Due to lack of
data synchronization in a peer-to-peer manner, eliminating
space we omit details of processing the diff state of filetrees
cost and bandwidth concerns that arise in the ”cloud master-
and performing optimizations to save network bandwidth, for
copy” approach. Since a system prototype is currently under
example in the case of renamed files or directories. We only
development, we do not present implementation details, or
describe a summary of the process using phases between the
evaluation results.
coordinator and the rest of the replicas.
All operation phases are computed in the cloud service ACKNOWLEDGMENTS
and then sent to the master replica for performing peer-to- We thankfully acknowledge the support of the European
peer synchronization between all replicas. To propagate all the Commission under the 7th Framework Programs through the
updates and to attain a complete sync state with each replica, FP7 ITN project SCALUS (contract no 238808).
we need two rounds of communication with the master replica.
R EFERENCES
An example of the first round is shown in Figure 3. As
shown, when phase 1 is executed between the master replica [1] A. Tridgell, “Efficient algorithms for sorting and synchronization,” 2000.
[2] B. C. Pierce, J. Vouillon, and J. Vouillon, “What’s in unison? a formal
A and replica B, all updates from B are propagated to A hence specification and reference implementation of a file synchronizer,” tech.
at the end, replica A contains updates of A+B. Similarly, after rep., 2004.
phase 2 is executed with master replica, all A+B updates from [3] “Sync Butler.” http://code.google.com/p/syncbutler.
[4] Drew Houston and Arash Ferdowsi, “Dropbox.” www.dropbox.com.
master is propagated to C and updates from C is propagated [5] CodeDroids, “OneSync.” http://onesync.googlecode.com.
to master, which at the end will contain updates A+B+C. [6] Siber Systems Inc., “GoodSync.” http://www.goodsync.com/support/manual.
After the master replica executes the first round of phases [7] L. Yecies and D. Mihovilovic, “SugarSync.” https://www.sugarsync.com/.

You might also like