You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/308729752

A study on data deduplication techniques for optimized storage

Conference Paper · December 2014


DOI: 10.1109/ICoAC.2014.7229702

CITATIONS READS

27 855

2 authors:

Manogar Ellappan Abirami Murugappan


Anna University, Chennai 60 PUBLICATIONS   441 CITATIONS   
1 PUBLICATION   27 CITATIONS   
SEE PROFILE
SEE PROFILE

All content following this page was uploaded by Manogar Ellappan on 12 November 2018.

The user has requested enhancement of the downloaded file.


2014 Sixth International Conference on Advanced Computing(lCoAC)

A Study on Data Deduplication Techniques for


Optimized Storage
E. Manogar, S. Abirami

example by taking email storage system, many copies of the


Abstract-In recent years, the explosion of the data such as text, same messages and file attachments have been shared by
image, audio, video, data centers and backup data lead to a lot of many people causing data deduplication.
problem in both storage and retrieval process. The enterprises To address this, many techniques like data compression,
invest lot of money for storing the data. Hence, an efficient
data deduplication can be used to improve the storage capacity
technique is needed for handling the enormous data. There are
thereby reducing the replications of data. Data management
two existing techniques for eliminating the redundant data in the
industries have shown that efficient management of data is
storage system such as data deduplication and data reduction.
Data deduplication is one of the best technique which eliminates
possible by implementing data mining and redundant data
redundant data, reduces the bandwidth and also minimizes the detection algorithms which in tum can effectively reduce the
disk usage and cost. Various research papers have been studied storage requirements of an enterprise. Today Dropbox, is an
from the literature, as the result, this paper attempts to online storage provider which provides a free online storage of
summarize various storage optimization techniques, concepts and private data up to 2 GB. Google, a search engine enterprise
categories using data deduplication. In addition to this, chunk has recently added a new product called the Google Drive
based data deduplication techniques are surveyed in detail. which also provides a free online storage of up to 5 GB. These
enterprises are able to achieve this by implementing an
Index Terms- chunk based deduplication, data deduplication, efficient data deduplication file system and by managing
data reduction,redundant data.
redundant data efficiently. Most of the storage solutions
provider industries such as NetApp and EMC employs data
deduplication technique in their backup solution software.
I. INTRODUCTION
This technique is used to reduce the amount of daily backed

I says that, approximately 1.8 zettabytes of data were created


NTERNATIONAL Data Corporation (lDC) recent estimate up data sent from the client to the backup servers over the
Internet thereby reducing bandwidth and cost.
and replicated all over the world, which is estimated that, it The rest of the paper is organized as follows: Section 2
will be growing to 7.9 zettabytes by the end of 2015 [1].The summarizes various storage optimization techniques, whereas
world data managing servers will increase more than 10 times Section 3 explains in detail on introduction about
over next decade. IDC predicted that, in the year of 2020,data deduplication, need of deduplication, Various techniques
from social networks, satellite information, sensors devices, involved in implementing deduplication and finally Section 4
clothing and structures like buildings and bridges the overall concludes the paper.
willgrown by 50 times. Many enterprises are struggling with
the growth of information and in the process of its protection. II. STORAGE OPTIMIZATION TECHNIQUES

So enterprises needs solutions to manage this explosion of There are several kind of Storage Optimization methods
information.Data warehouse takes up huge amount of storage which can be used to store the data. They areThin
in terabytes, petabytes of data. Most of the data stores contains proVISIOning, Snapshots, Clones, Deduplication and
the data derived from other data, so there is a chance, lot of Compression. But the techniques which are commonly used
duplicate data might be stored inside the data warehouse. in data storage optimization are data deduplication and data
While storing a large amount of duplicate data, it may affect compression, which is explained in detail the sections 3 and 4.
performance, bandwidth, storage inconsistencies, etc., for
A. Thin Provisioning

E.Manogar, Department of Information Science and Technology, College


Thin provisioning is an emerging storage technology
of Engineering, Guindy, Chennai, Tamil Nadu, India,e-mail: which is used in storage optimization. It is a shared-storage
manogar.info@gmail.com. environment, which relies on on-demand allocation of blocks
S.Abirami,Department of lnformation Science and Technology, College of
of data when applications run in out of storage space and
Engineering, Guindy, Chennai, Tamil Nadu, India, e-mail:
abirami@annauniv.edu.
thereby reducing the risk of application failures. This
methodology eliminates almost all whitespaces which help to
978-1-4799-8159-5114/$31.00©2014 IEEE avoid the poor utilization rates and to acquire higher storage
utilization [3]. Thin provisioning system is one of the best

161
2014 Sixth International Conference on Advanced Computing(lCoAC)
technique which eliminates the unused capacity of physical "corresponding author." This is the author to whom proofs of
disk thereby achieve higher storage utilization. Fig. 1 shows the paper will be sent. Proofs are sent to the corresponding
Traditional storage vs. thin provisioning storage and how author only.
storage administrators typically allocate more storage than it is
C. Clones
required for applications [4]. Volume A contains only 200 GB
physical data out of 500 GB, but the remaining unused 300 Clones are an advanced form of writable snapshots. They are
GB is allocated for future use. essentially a snapshot volume presented as a 'real' volume can
be modified or changed. Initially clones had limited value,
which is used primarily for test and development applications.
With the rise of Virtualization field, especially desktop
virtualization, clones have immense values in reducing the
storage footprint [5] that these environments require. They can
also help improve performance since hundreds of virtual
machine-based storage images can now be loaded into a
cache.
D. Data Deduplication

Data deduplication is a technique which is used to track


and eliminate the duplicate chunks (piece of data) in a storage
unit. So many vendors are using this technology to implement
for efficient data storage, but there is separate merits and
demerits. Deduplication is more important at the shared
Fig. I. Traditional allocation and Thin provisioning
storage level [8], however, implementations in software and
the database. The most suitable candidates for deduplication
This unused allocated storage cannot be used by any other
are platform virtualization and backup server, because both
applications. Moreover, the full volume of storage is never applications will use and produce a lot of identical/duplicate
used but it is essentially wasted. This is sometimes called as copies. However, few vendors offer in-place deduplication,
stranded storage. Thin provisioning is mainly designed to store
which deduplicates primary storage.
exact amount of data, when it is needed and it removes paid
wasted storage capacity [5]. In addition, when more storage is Deduplication takes place on the file level and block level.
needed additional volumesshould be added to the existing In file level deduplication, it eliminates duplicate or redundant
combined storage system. Thin provisioning is an on-demand copies in the same file. This type of deduplication is called as
storage system that remove allocated but not being used single instance storage (SIS). In block level deduplication [9],
capacity. it eliminates redundant or duplicated blocks of data which is
A good example is Gmail, where every gmail account present in unique files. Block-level deduplication reduces
contains a large amount of allocated capacitybut most of the more space than SIS [10], this type of deduplication is known
gmail users use less amount of the allocated storage space. as variable block or a variable length deduplication. Since the
word data deduplication is used as a synonym for block-level
B. Snapshot Technology
or a variable length deduplication which is explained in
Snapshot technologyallows to stores only the changes section 4 in detail.
between the each dataset which is accessed by multiple time
E. Compression
for various reasons. Some storage vendors uses the snapshot
technologies at the operating system level to enable and access Data compression is a mechanism which saves the storage
data in application level layers.At present the term "clones and space by removing the binary level redundant data within a
snapshots" are confusing, since care should be taken when data block [11]. Unlike deduplication, compression technique
evaluating the vendor claims. In particular, few vendors'uses simply stores only the most efficient block, and it is not
concerned with the choice of second copy of the same block
full point-in-time copies of "snapshots" or "c1ones"[5], while
exists or not. Since, compression technique works within a
some uses the same term to referas shared-block "delta"
data block the requirement of memory resource is relatively
snapshots or clones. For implementation some vendor's uses
small and also it looks only one file at a time. [5]. JPG image,
this technology for read-only snapshots while others to
audio and video files are day-to-day live examples for file
provide writable ones,which is formally known as "delta level compression.
snapshot" technology.Also send a sheet of paper with
complete contact information for all authors. Include full TIT. DATA DEDUPLTCATION IN STORAGE
mailing addresses, telephone numbers, fax numbers, and e­
Data deduplication is one of the well growing technology
mail addresses. This information will be used to send each for optimizing the storage and it saves a lots of money in
author a complimentary copy of the journal in which the paper companies by reducing the storage and bandwidth cost. It is
appears. In addition, designate one author as the

162
2014 Sixth International Conference on Advanced Computing(lCoAC)

helpful for cloud providers, because this technique needs less


hardware to store the data. This process consists of four steps:
The advantage of data deduplication are [12]: 1) Dividing the input data into blocks or "chunks."
1) Hardware costs is reduced 2) Calculate a hash value of every block of data.
2) Backup costs is less 3) Using hash value to check whether the same block of data
3) Storage efficiency is increased and is present in another stored block data.
4) Improved network efficiency and reduced bandwidth. 4) If duplicate data is found the reference to be created in
database.
Data deduplication is a process used to eliminate the 5) Based on the results, the duplicates data is eliminated.
redundant data [24]. In this process, the duplicated data are Only a unique chunk is stored.
deleted and unique chunks (sequence of bytes) of data, or
patterns which are identified and stored during process of IV. DEDUPLlCATION IMPLEMENTATlONMETHODS
analysis, thereby reducing the disk space and also used to There are many techniques for eliminating the redundancy
reduce bandwidth in a network. In this analysis, the block are of data which is stored in the data warehouse. However,
compared to the storage block in disk, when the duplicate majority of the organization use the data deduplication
block is found, that duplicate block is replaced with reference techniques to store and to minimize the redundancy problem
pointer that points the duplicate block store in the disk. For [8].
example same block of data may occur in even hundreds, or Data deduplication is done by the following methods as
thousands of times, but the match frequency of chunk size shown in Fig. 2.
dependent, by using this match frequency the data must 1) Location based deduplication
transfer and store with reduced storage space.
2) Time based deduplication
A. Deduplication Working Strategy 3) Chunk based deduplication
In simplified terms, data deduplication is a process which A. Location Based Deduplication
compares and removes files or blocks which is already exist in
Deduplication process can be performed at different
the storage[13]. The deduplication process removes blocks
locations. Based on which location, the deduplication process
that are not unique.
is happening, the entire deduplication is carried out either at
the source side (Client) or target side (Server) [12].

Deduplication

Location Based Time Based Chunk Based

I
Source Target
I I
Single Instance Fixed Size Variable Size
(Client) (Server)
Storage Chunking Chunking

I I
In - Line Post - Process

Fig. 2. Types of Deduplication Technique

163
2014 Sixth International Conference on Advanced Computing(ICoAC) 4

1) Source (Client) side Deduplication:As the name implies disk (Post), or both before and after written to the disk
source based deduplication process happens at the client side. (Hybrid) [8].
Fig. 3 shows that,the source based deduplication process is
1) An Inline deduplication can be done the client side or
carried out by placing dedupe agent at the physical or virtual
when the data is transferring from the client/source to the
server. Here, dedupe agent will check all the duplicating over
server. Inline deduplication is a process where the data is
the backup server and then only unique data blocks gets
deduplicated before it is written to disk. If a block of data
transmitted to the disk. This process is done before the data
arrives into the process/appliance, it analysis that, whether the
goes over the network. The advantage of the source side
data block has been processed already or not. If the data
deduplication is, it requires fewer bandwidth requirements to
processed before, it pulled away from the redundant block
the data and only the changed data gets backed up.
1------------------------- ,
then writes a reference to that block. If it identify the block of
I I
I I
I
data is unique, the process/appliance writes the block into the
I
I [)e I
dltp ll
storage. Since, the analysis data block is initiated before it is
I
..,cate [) written. This method of deduplication used in work in RAM,
at a
I
I to minimizes 110 overhead and thereby it saves disk space.
I
I However, it requires substantial resources and this become a
I
I network's bottleneck [13]. The advantage of inline
deduplication is that it does not require extra disk space. Data
Domain, Hewlett Packed and Diligent Technologies are few of
the companies offering inline and chunk based deduplication
Backup Servel' products [15], [16], [17]. In-Line deduplication process has
been depicted in Fig. 5.

Client Side Data Optimization Storage

Fig. 3. Source (Client) side Deduplication

2) Target side Deduplication:In the target based


deduplication [12], the deduplication process is done on the
target server. Here the process is carried out after receiving
the backup data from the client which is shown in Fig. 4.
Deduplication appliance in the backup server handles all sort
of deduplication. The main advantage of this method is that, Fig. 5. In-Line Deduplication

clients are eliminated from the overhead of the deduplication


process. 2) Post Process deduplication:Post-Processing operations
are performed on the server side. The source data is written to
------------------- 1 the backup server storage and duplicates are cleared later.
I
I Post-processing deduplication process gets started once it is
I
I written to disk. Here, the deduplication process starts and
I
I reclaims the data space allotted [14]. The advantage of the
I
I post - process deduplication is that the performance is higher
I
I than In-Iine deduplication. Another benefit of this method is
I
I the ability to share the index and metadata, hence clustering
I
I for higher availability (HA) is easier, and data replication can
I
I
be much more efficient. The disadvantage is that the need for
I
I
the fast disk cache, which typically makes the initial purchase
I
I
price higher than inline-based solutions. Nevertheless, because
Sen'er Side I non-deduplicable data does not need to be stored, post-process
I I
L ___________________ J
deduplication may be more cost-efficient in the long run.
Web Server ExaGrid EX, FalconStor FDS, and SepatonDeltaStor
Technologies are few of the companies which offer post­
Fig. 4. Target side Deduplication
process deduplication products [15], [16], [17]. Post Process
deduplication process is shown in Fig. 6.
B. Time Based Deduplication
One of the important criteria to be considered should be the
taken while designing the duplication process is "when to
duplicate the data". So data can be processed at three places, C. Chunk based deduplication
before being written into a disk (Inline) or after writing to the In this method, data are divided into the sequence of bytes
called chunks then the chunks are used to examine the

164
2014 Sixth International Conference on Advanced Computing(ICoAC) 5

redundancy. Chunking is a process which breaks the data into chunk boundaries is Rabin fmger printing algorithm, and
a number of small pieces called chunks or blocks and it stores each chunk can be converted into hash values using
only the unique chunks. common hashing technique such as MD5 or SHAI.
Data Storage
Optimization
V. COMPARASION OF METRICS ACROSS DEDUPLICATION
METHODS

Table I. illustrates and the shows various metrics which has


been measured across the deduplication strategies such as file
level, fixed size and variable size like Deduplication ratio,
Processing Time, Index overhead. The deduplication ratio is a
value that gives amount of the space saved through the
deduplication process. It is the ratio of total number of input
bytes before deduplication to the total number of output bytes
Fig. 6. Post- Process Deduplication
after deduplication [IS].
Fig.7 shows comparison results of fixed size and variable
size chunking methods. These results are observed based on
Different chunk based deduplication strategies are the deduplication ratio. For smaller evaluation, 100MB,
available in the literature to remove the redundant data which 200MB, 300MB, 400MB capacity file has been taken to
is present in the disk or backup system as discussed below: obtain the results. But the deduplication would work better for
1) Single Instance Storage or Whole file chunking large datasets.
2) Fixed Size Chunking
3) Variable size Chunking
1) Single Instance Storage (SIS) or Whole file
chunking:The single instance storage or whole file chunking
does not break files into a smaller chunk, rather than it treats
the whole file as a chunk. It finds the hash value for the entire
chunk which is the file index. If a new incoming file matches
with the file index, then it is considered as duplicate and it
points to the existing file index.
2) Fixed Size Chunking:In this data deduplication
algorithm, it breaks the files into equals sized chunks in which
the chunk boundaries are fixed such as 4KB, SKB, etc. for
example if a chunk size is defined as 4KB, a file is chunked at
SKB,12KB,16KB,20KB continuously.As a result, content
Fig. 7. Comparison Analysis of Fixed and Variable Size Chunking
based checksum identified the chunks and stores only the
index which does not exist [22]. Table I. Comparison of performance metrics across File level, Fixed size
.
This method overcomes the issues of SIS approach: let us and Vanable sIze DedrupncatJon
' Meth0 ds

consider a large file which is changed in only a few bytes, in Metrics File Level Fixed Size Variable Size
this situation this approach makes the chunks to be reindexed
and stored in the backup location. For example, in a docwnent Deduplication Ratio Less Medium High
of 5GB which is changed by the user in only 100KB, the old
and new file having different checkswns, but SIS stores the Processing Time Medium Less High

full version of the file and it results in the total size of


10GB.On the other hand, using fixed size chunking requires Index Overhead Better Worst Worst

only (lOOKB/SKB)*SKB = 104KB of additional data is to


store. It has been improved that file level deduplication operates
3) Variable Size Chunking: Variable block chunking is better on the client size and variable size chunking works well
different from fixed block chunking. Here, chunking on the server side [24].
boundaries are determined based on the contents of the file, so
VI. CONCLUSION
it is more resistant to the insertion and deletion. Now this has
been believed as the best algorithm for backup system [23]. In this paper, we have mainly discussed surveyed the
Similar to fixed size chunking method variable size various deduplication techniques. Among them, it has been
chunking has three important steps to follow which are as concluded that variable size data deduplication is well and
follows: good when compared to other strategies by comparing the
1) Dividing the file into variable blocks based on the chunk hash of each and every chunk. Hence, this technique improves
boundaries. storage efficiency and thereby improve the performance by
2) Generating the hash values for each block. enabling storage resources to transfer and handle more data. In
3) Identify the redundant data from the hash values. future, more research works could be focused on variable size
4) Here, the algorithm which could be used for finding chunking method to reduce processing time, and optimize of

165
2014 Sixth International Conference on Advanced Computing(ICoAC) 6

large scale data storage. And also to develop an efficient Performance Computing and Communications & 2013 IEEE
International Conference on Embedded and Ubiquitous Computing
method to reduce fragmentation and obtain high write and
(HPCC_EUC),pp 1982 - 1989,2013.
read throughput.
[24] Daehee Kim, Sejun Song, Baek-Young Choi, "SAFE: Structure-Aware
File and Email Deduplication for Cloud-based Storage Systems," pp
REFERENCES 130-137,IEEE,2013.

[I] Walid Mohamed Aly, Hany AtefKelleny,"Adaptation of Cuckoo Search


for Documents Clustering," International Journal of Computer
Applications (0975 - 8887),Volume 86 - No 1,2014.

[2] John Gantz, David Reinsel. (June 20ll), "Extracting Value from
Chaos," Sponsored by EMC Corporation [Online]. Available:
http://www. emc.com!

[3] Min Li, Shravan Gaonkar, Ali R. Butt, Deepak Kenchammana, and
Kaladhar Voruganti, "Cooperative Storage-Level Deduplication for 110
Reduction in Virtualized Data Centers," IEEE International Symposium
on Modeling, Analysis & Simulation of Computer and
Telecommunication Systems ,pp.209-218,2012.

[4] Andre Brinkmann, Sascha Effert, "Snapshots and Continuous Data


Replication in Cluster Storage Environments," Fourth International
Workshop on Storage Network Architecture and Parallel I/O,
IEEE,2008.

[5] George Crump (2011, September 30). Which Primary Storage


Optimization is Best? [Online]. Available: http://www.storage­
switzerland. com!

[6] Eunji Lee, Jee E. Jang, Taeseok Kim, Hyokyung Bahn, "On-Demand
Snapshot: An Efficient Versioning File System for Phase-Change
Memory," IEEE Transactions On Knowledge And Data Engineering,
Vol. 25,No. 12,December 2013.

[7] Kai Qian , Letian Yi ,liwu Shu, "ThinStore: Out-of-Band Virtualization


with Thin Provisioning," Sixth IEEE International Conference on
Networking,Architecture,and Storage, IEEE,2011.

[8] Philipp C. Heckel ( 2013, May 20). "Minimizing remote storage usage
and synchronization time using deduplication and multichunking,"
[Online]. Available: http://blog.philippheckel.com!

[9] Q. He, Z. Li, X. Zhang, "Data deduplication techniques,"Future


Information Technology and Management Engineering (FITME)," vol.
I,pp. 430-433,2010.

[10] Maddodi.S, Attigeri G.V, Karunakar. A.K, "Data Deduplication


Techniques and Analysis," Emerging Trends in Engineering and
Technology (lCETET), pp 664 - 668,IEEE, 2010.

[11] Sandip Agarwala, Divyesh Jadav, Luis A Bathen, "iCostale: Adaptive


Cost Optimization for Storage Clouds," IEEE 4th International
Conference on Cloud Computing,IEEE, 2011.

[12] Chris Poelker (Aug 20,2013). Intelligent Storage Networking [Online].


Available: http://www.computerworld.com/

[13] Benjamin Zhu, Kai Li, and Hugo Patterson, "Avoiding the Disk
Bottleneck in the Data Domain Deduplication File System," Proc. of the
USENIX File And Storage Technologies,2008.
[14] D. T. Meyer, W. 1. Bolosky (2012), " A Study of Practical
Deduplication,"[Online]. Available:http://static.usenix.Orgf

[15] Data Domain LLC. Data Domain Boost Software. [Online].


Available:http://www.datadomain.com/
[16] Symantec Corporation. Symantec NetBackup PureDisk. [Online].
Available:http://www.symantec.com/
[17] ExaGrid Systems. ExaGrid EX Series Product Line.[Online]. Available:
http://www.exagrid.com/
[18] M. Dutch, "Understanding data deduplication ratios," In SNIA Data
Management Forum,2008.
[19] K. lin and E.L. Miller, "Deduplication on Virtual Machine Disk
Images," Ph.D. thesis,University of California,Santa Cruz,2010.
[20] Dave Cannon (March 2009), Data Deduplication and Tivoli Storage
Manager. [Online]. Available: https://www. ibm.com
[21] N. Mandagere, P. Zhou, M.A. Smith, and S. Uttamchandani.
"Demystifying data deduplication," In Proceedings of the
ACM/IFlP/USENIX Middleware'08 Conference Companion,pages 12-
17. ACM,2008.
[22] Deepavali Bhagwat, Kave Eshghi, Darrell D. E. Long, and Mark
Lillibridge, "Ex-treme binning: Scalable, parallel deduplication for
chunk-based File backup," In MASCOTS,pp 1-9,IEEE,2009.
[23] Jin-Yong Ha,Young-Sik Lee, Jin-Soo Kim, "Deduplication with Block­
Level Content-AwareChunking for Solid State Drives (SSDs)," High

View publication stats 166

You might also like