You are on page 1of 8

NetApp Deduplication

In 2007, NetApp introduced deduplication technology that significantly decreases storage capacity requirements. NetApp deduplication improves efficiency by locating identical blocks of data and replacing them with references to a single shared block after performing a byte-level verification check. This technique reduces storage capacity requirements by eliminating redundant blocks of data that reside in the same volume or LUN. NetApp deduplication is an integral part of the NetApp Data ONTAP operating environment and the WAFL file system, which manages all data on NetApp storage systems. Deduplication works "behind the scenes," regardless of what applications you run or how you access data, and its overhead is low. A common question is, "How much space can you save?" We'll come back to this question in more detail later, but, in general, it depends on the dataset and the amount of duplication it contains. Significant advantages of NetApp deduplication:  Operates on NetApp or third-party primary, secondary, and archive storage  Application independent  Protocol independent  Minimal overhead  Works on both NetApp FAS and V-Series storage systems  Byte-by-byte validation  Can be applied to new data or data previously stored in volumes and LUNs  Can run during off-peak times  Integrated with other NetApp storage efficiency technologies  Savings due to deduplication can be inherited when using SnapMirror® or Flash Cache  Simple evaluation tools  Free of charge

How Deduplication Is Implemented in Data ONTAP

At its heart, NetApp deduplication relies on the time-honored computer science technique of reference counting. Previously, Data ONTAP kept track only of whether a block was free or in use. With deduplication, it also keeps track of how many uses there are. Using deduplication, a single block can be referenced up to 255 times for both NAS and SAN configurations. Files don't "know" that they are using shared blocks—bookkeeping within WAFL takes care of the details invisibly.

Figure 1) How NetApp deduplication works. it computes a "fingerprint. as determined by the storage system administrator:  A deduplication "start" command is issued manually  A scheduled deduplication process occurs  20% new data has been written to the volume ." which is a hash of the block's data. To avoid slowing down ordinary operations. Two blocks that have the same fingerprint are candidates for sharing. the data is ready for deduplication. WAFL creates a catalog of fingerprints for this data. for each block. it computes a database of fingerprints for all of the in-use blocks in the volume (a process known as "gathering"). As data is written during normal use. Once this initial setup is finished. This catalog accumulates until dedupe is triggered by one of the following events. How does Data ONTAP decide that two blocks can be shared? The answer is that. When NetApp deduplication is enabled. the search for duplicates is done as a separate batch process.

we of course can take this step only if both blocks are really still in use and contain the same data. fingerprints remain valid until a block gets freed. The implementation of NetApp deduplication takes advantage of some special features of WAFL to minimize the cost of deduplication. using the fingerprints of the changed blocks as a key. Since the file system is changing all the time. To make sure that two blocks are really identical. The upshot is that deduplication can be used with a wide range of workloads. Data ONTAP can discard one of the blocks and replace it with a reference to the other block. a sorting operation begins. Many articles have focused on the particular benefits of deduplication in VMware and VDI environments. The tight integration of NetApp deduplication with WAFL also means that change logging is an efficient operation. every block of data on disk is protected with a checksum. Whenever the same fingerprint appears in both lists. we do a byte-by-byte comparison after the candidate blocks are identified. not just for backups. NetApp uses this checksum as the basis for the fingerprint. For instance. Microsoft SharePoint® and Exchange 2010 are also rapidly gaining traction. The most popular use cases are VMware® and VDI. . as has been the case with other deduplication implementations. we get it "for free"—there is no additional load on the system. And since WAFL never overwrites a block of data that is in use. Since we were going to compute it anyway. This sorted list merges with the fingerprint database file. there are possibly identical blocks that can be collapsed into one. When a SnapVault® transfer is completed Once the deduplication process is started. Use Cases: NetApp has been measuring the benefits of deduplication in real-world environments since deduplication was introduced. In this case. and file services. Table 1) Typical deduplication space savings. home directory data. which have an inherently high level of file duplication owing to the nearly identical operating system environments used by each virtual machine. The following table summarizes typical results for a variety of environments.

customers typically see space savings of up to 90%. Applying NetApp deduplication can eliminate much of the inherent redundancy. In general terms. if you have X virtual machines assigned to a storage volume. Obviously. This is for deduplication of the entire VMware storage environment.In a typical VMware or VDI environment you may have numerous virtual machines (VMs) that are all installed with more or less the same operating system and applications. and Schlumberger Petrel software for seismic data analysis. In VDI environments. the actual results you achieve will depend on how many VMs you have in a volume and how similar they are. In practice. with some obtaining storage savings as high as 90%. customers typically see space savings of 50% or more in ESX VI3 environments. including Siemens Teamcenter PLM software. If you have 100 VMs running the same OS and each virtual machine requires 10GB to 20GB of storage. NetApp has also been investigating the benefits of deduplication on the repositories of unstructured file data created by some popular engineering and scientific applications. including application data—not just operating systems. resulting in a lot of duplication. . after deduplication you will need approximately 1/X the amount of operating system storage you would require in a nondeduplicated environment. IBM Rational ClearCase SCM software. that's 1TB to 2TB of storage dedicated to almost identical copies.

which simulates the creation of multiple revisions of many design files. in general you have to be careful when using simulators to measure the potential savings of deduplication. even if the change made to the design is minor. in which files are stored. as would occur during normal use.) Similar to Teamcenter. Simulated data in many cases will create artificially high amounts of duplicate data. duplicate data objects are stored across multiple storage devices. Results in the real world might be even higher than this since in many cases the number of file revisions is likely to be higher than that which we simulated. not data patterns. a complete copy of that design file is saved in the vault. and simulation workflows. NetApp worked closely with Siemens PLM to assess the value of deduplication in a Teamcenter environment using Siemens's performance and scalability benchmarking tool. distribute. . IBM Rational ClearCase—a leading software configuration management solution—consists of a metadata database in combination with a large "versioned object base. preliminary results in a laboratory environment suggest space savings of 40% or more using deduplication in a ClearCase environment when whole files are stored. NetApp observed space savings of approximately 48% by applying deduplication to such project directories. (Of course. In addition. As users create. Schlumberger Petrel is used for seismic data interpretation. and archive data." where engineering design files are stored. since the focus is typically performance. Every time an engineer saves a design within Teamcenter. It creates project directories that contain huge numbers of files." or VOB. reservoir visualization.Teamcenter utilizes a relatively small metadata database combined with a large "vault. Deduplication is most useful with ClearCase in situations in which a copy of a VOB needs to be made. Deduplication of the resulting vault yielded a 57% space savings.

such as those summarized in Table 1. run deduplication infrequently.  If your application creates only a small amount of new data. attention to a few best practices can help enable success with deduplication. In addition to these requirements. .  Deduplication consumes system resources and can alter the data layout on disk. especially for applications that have not been previously tested by NetApp. the read and write I/O performance can change. Some of the important best practices are summarized in this section along with information about using deduplication in conjunction with other popular NetApp technologies.Using NetApp Deduplication: The basic requirements for running NetApp deduplication are summarized in Table 2.  You should measure the performance impact of deduplication as well as the sizing requirements in a test environment before deployment. Due to the application's I/O pattern and the effect of deduplication on the data layout. The space savings and the performance impact depend on the application and the data contents. Table 2) Basic requirements for NetApp deduplication.

The frequency for running deduplication depends on the rate of change of the data in the flexible volume. FlexClone® technology instantly replicates data volumes and datasets as transparent.3 or later. For Data ONTAP versions 7. (This tends to naturally stagger out when deduplication runs in smaller environments.  If a Snapshot copy is created before the deduplication process has completed. When a FlexClone volume is created:  If the parent FlexClone volume has duplication enabled. such as the deduplication schedule. Deduplication must be started on the cloned volume for full . reducing the possibility of running too many concurrent processes. the target volume automatically inherits the deduplicated state from the source volume. Run the deduplication process before creating Snapshot copies if possible and make sure that the deduplication process has completed before creating a Snapshot copy. the new volume inherits the savings.  FlexClone. you need to leave some free space for the deduplication metadata.3. and each flexible volume should have 2% of the total data's worth of free space. The best option may be to do one of the following:  Stagger the deduplication schedule for the flexible volumes so that deduplication processes run on alternate days. the aggregate should have 4% (fingerprint + change logs) of the total data's worth of free space for all deduplicated flexible volumes. it is likely to result in lower space savings. NetApp deduplication increases the likelihood of a cache hit. virtual copies.  For deduplication to run properly.  The more concurrent deduplication scanner processes you run. the deduplication metadata files (the fingerprint database and the change log files) do not get cloned.  Run deduplication nightly to minimize the amount of new data to be deduplicated. In many cases. because they are located outside the volume in the aggregate. those technologies get an additional benefit:  Flash Cache: Provides intelligent caching that accelerates I/O operations. This effect is referred to as cache amplification. each flexible volume should have 6% of the total data's worth of free space.3. Deduplication and Other NetApp Technologies: Deduplication is designed to work with other NetApp technologies. so it requires less time to complete.  Volume SnapMirror: When you replicate a deduplicated volume with volume SnapMirror.  The cloned volume inherits the deduplication configuration of the parent volume. the more system resources are consumed. and is particularly beneficial with server and desktop virtualization.  Starting with Data ONTAP 7.)  Run deduplication manually.because there is very little benefit in running it frequently in such a case. For Data ONTAP versions prior to 7. the probability that it will be requested again is much higher.  Use Auto mode so that deduplication runs only when significant additional data has been written to each flexible volume. When a deduplicated block is in Flash Cache.

and others. Oracle. introduced in Data ONTAP 8. SQL Server. is highly complementary to deduplication.  LUN space reservation value = off  Volume fractional reserve value = any value from 0–100  Volume guarantee = none  Snap reserve = 0%  Autodelete = on  Autosize = on  Try_first = volume_grow Conclusion: Deduplication is an important storage efficiency tool that can be used alone or in conjunction with other storage efficiency solutions such as NetApp thin provisioning. Deduplication also works in conjunction with NetApp thin provisioning to maximize space savings.  Thin provisioning. configuration is straightforward. Lotus Domino. and SharePoint.  Configuration and operation  Sizing for performance and space efficiency  Use with other NetApp technologies  Best practices including best practices for use with specific applications including VMware. and others  Troubleshooting .  Compression: NetApp data compression.0.deduplication to continue.1. FlexClone. For NAS volumes. Microsoft Exchange.