You are on page 1of 28

Technical Report:

NetApp Deduplication for FAS

Deployment and Implementation Guide


Network Appliance, Inc. | Bill May, Data Protection and Retention Technical Marketing | 16 April 2008 | TR-3505

4th Revision

TECHNICAL
REPORT

management strategies.
advanced storage solutions and global data
complex technical challenges with
organizations understand and meet
data storage technology, helps
NetApp, a pioneer and industry leader in

Abstract
This guide introduces the NetApp deduplication for FAS technology and describes in detail how to
implement and utilize it.
It should prove useful for customers requiring assistance in understanding and architecting solutions
with deduplication for FAS and NetApp storage systems.

NetApp, Inc.
This page is intentionally blank.

NetApp, Inc.
Technical Report: 16 April 2008
NetApp Deduplication for FAS TR-3505
Deployment and Implementation Guide 4th Revision

Table of Contents

1 Introduction............................................................................................................1
1.1 Intended Audience...................................................................................................... 1
1.2 Purpose....................................................................................................................... 1
1.3 Prerequisites and Assumptions ................................................................................. 1
1.4 Document Conventions.............................................................................................. 1
2 Overview.................................................................................................................2
2.1 NetApp Deduplication Technologies ......................................................................... 2
2.1.1 SnapVault for NetBackup™.................................................................................................. 3
2.1.2 NetApp Deduplication for FAS.............................................................................................. 3
2.2 Dense Volumes .......................................................................................................... 3
2.3 Deduplication Features and Functions ...................................................................... 4
2.3.1 General Deduplication Operational Considerations ............................................................ 5
3 Configuration and Operation ...............................................................................6
3.1 Requirements Overview............................................................................................. 6
3.2 Installing and Licensing Deduplication....................................................................... 6
3.2.1 Deduplication Licensing in a Clustered Environment .......................................................... 7
3.3 Command Summary .................................................................................................. 7
3.4 Deduplication Quick Start Guide................................................................................ 8
3.5 Monitoring Deduplication Status ................................................................................ 8
3.6 End-to-End Deduplication Configuration Example.................................................. 10
3.7 Configuring Deduplication Schedules...................................................................... 14
4 Operating Characteristics ..................................................................................16
4.1 Deduplication Target Environment .......................................................................... 16
4.2 Deduplication Performance...................................................................................... 16
4.3 Deduplication Storage Savings................................................................................ 16
4.4 Additional Deduplication Considerations ................................................................. 16
4.4.1 Number of Deduplication Processes.................................................................................. 17
4.4.2 Deduplication and Active/Active Configuration .................................................................. 17
4.4.3 Deduplication and Space Savings on Existing Data ......................................................... 17
4.4.4 Deduplication Best Practices .............................................................................................. 18

NetApp, Inc.
ii
Technical Report: 16 April 2008
NetApp Deduplication for FAS TR-3505
Deployment and Implementation Guide 4th Revision

5 Common Problems and Troubleshooting .......................................................19


5.1 Licensing................................................................................................................... 19
5.2 Volume Sizes............................................................................................................ 19
5.3 Logs and Error Messages........................................................................................ 19
5.4 Other Issues.............................................................................................................. 19
5.5 Not Seeing Space Savings ...................................................................................... 20
5.6 Undeduplicating a Flexible Volume ......................................................................... 20
5.7 Additional Reporting with “sis stat –l”....................................................................... 21
5.8 Deduplication and Reboots...................................................................................... 21
6 Deduplication and Replication ..........................................................................22
6.1 Replicating a Deduplicated Flexible Volume for DR ............................................... 22
6.2 Replicating Primary Data to a Deduplicated Flexible Volume ................................ 23

NetApp, Inc.
iii
Technical Report: 16 April 2008
NetApp Deduplication for FAS TR-3505
Deployment and Implementation Guide 4th Revision

1 Introduction

1.1 Intended Audience


This technical report is designed for customers who seek education on the NetApp deduplication for
FAS capability introduced in Data ONTAP® 7.2L1, with the current minimum requirement of Data
ONTAP 7.2.4.
It will be most beneficial to those who are already familiar with NetApp hardware and software.

1.2 Purpose
The purpose of this paper is to present a guide for implementing NetApp deduplication for FAS. It will
address step-by-step configuration examples, introduce known caveats and recommendations to
assist the reader in designing optimal solutions, and prepare the audience for performing
deployments of the technology in customer environments.
Its use is threefold:
ƒ Provide detailed information to all interested parties.
ƒ Educate prior to performing deployments.
ƒ Serve as a reference for resolving issues that could arise.
This document is not:
ƒ A sales guide (although some high-level thoughts are covered in the “Solutions Overview”
section)
ƒ A competitive comparison
ƒ A complete product design document

1.3 Prerequisites and Assumptions


For various details and procedures described in this document to be most useful to the reader, the
following assumptions are made:
ƒ The reader has general knowledge of NetApp platforms and products, particularly in the area
of data protection.
ƒ The reader has general knowledge of backup protection, data retention, and disaster
recovery solutions.

1.4 Document Conventions


ƒ While “NetApp deduplication for FAS” is the official solution name, for brevity’s sake in this
document it will typically be simply called “deduplication.”
ƒ Note that the original name was “Advanced Single Instance Storage (A-SIS),” so when
referring to other documents which may still use that name, be aware that it is synonymous
with NetApp deduplication for FAS.

NetApp, Inc. Introduction


1
Technical Report: 16 April 2008
NetApp Deduplication for FAS TR-3505
Deployment and Implementation Guide 4th Revision

2 Overview
This section provides a quick overview of deduplication in general and then introduces what NetApp
deduplication for FAS is and how it works at a high level.

2.1 NetApp Deduplication Technologies


Since its beginning NetApp has been an innovator in delivering storage solutions and continues to
invent new capacity optimizing technologies that reduce the cost of data storage. The following are
some of the basic products/features that deliver the value:
ƒ Snapshot™ for disk- and network-efficient recovery copies
ƒ SnapVault® for disk- and network-efficient backups
ƒ FlexVol® for space-efficient volume provisioning
ƒ FlexClone® for space-efficient test and development copies

While all these technologies offer the benefit of reducing the amount of required storage, in the
marketplace they are often not considered “deduplication” technologies when compared to solutions
offered by other vendors. That sentiment, while not entirely accurate, is understood, and NetApp
continues to expand its portfolio with several technologies for further deduplication of data. The
following subsections cover two of the solutions that are available as of the writing of this paper;
additional deduplication technologies are coming in both the short term and the more distant future.
Before delving into technical solutions, it makes sense to understand the value of deduplication to
customers. The primary advantage of data deduplication is that it conserves physical disk space
when storing data on disk. The average UNIX® or Windows® disk volume contains thousands of
duplicate data strings. Traditionally, when copies of these volumes are created, every duplicate data
string is also copied, resulting in an inefficient use of secondary storage. Deduplication helps to
remove this inefficiency and yields a more effective cost per gigabyte in the data center.

Figure 1) Reduced storage costs with deduplication.

NetApp, Inc. Overview


2
Technical Report: 16 April 2008
NetApp Deduplication for FAS TR-3505
Deployment and Implementation Guide 4th Revision

2.1.1 SnapVault for NetBackup™


The first industry-recognized deduplication technology from NetApp was SnapVault for NetBackup,
which provides space savings similar to those provided by SnapVault to traditional NetBackup
environments. This solution integrates the NetApp secondary storage system as an optimized backup
repository for heterogeneous (not NetApp) primary storage. Its value is based on the assumption that
a file in the same data set and path but in different backups is likely to have a lot of blocks in
common.
Backups written to a NetApp storage unit utilize less disk space when compared to traditional disk
storage units. After an initial client backup is performed, the Network Appliance™ Write Anywhere
File Layout (WAFL®) file system saves only changed blocks when subsequent backups are
performed for the same client, providing single-instance storage (SIS) deduplication of the additional
backup images.
To NetBackup, the backup on the NetApp system looks like a standard NetBackup TAR image
backup, allowing most normal NetBackup operations (duplication, synthetics, vaulting, and so on) to
be performed. To end users, the backup on the NetApp system looks like a standard WAFL file
system, accessible through NFS and CIFS.
SnapVault for NetBackup (SV-NBU) was released as a joint solution as part of Data ONTAP 7.1 and
NetBackup 6.0, with a focus on data protection.

2.1.2 NetApp Deduplication for FAS


Unlike SV-NBU, which performs block-level deduplication only for the same client/policy/directory/file,
and only for use with NetBackup, NetApp deduplication for FAS deduplicates blocks anywhere in the
active file system within the entire flexible volume, regardless of how the data got there.
In its initial release, deduplication primarily had a focus on data retention/archiving of file system data
on secondary storage NetApp systems. Substantial storage savings can be achieved with
deduplication in some tier 2 primary storage environments as well.
While NetApp deduplication for FAS is really part of a suite of deduplication technologies offered by
NetApp, it is the sole focus of the remainder of this paper.

2.2 Dense Volumes


Despite the introduction of less expensive ATA disk drives, one of the biggest challenges for disk-
based backup today continues to be the storage cost. There is a desire to reduce storage
consumption (and therefore storage cost per megabyte) by eliminating duplicated data through
sharing across files.
The core NetApp technology to accomplish this goal is the dense volume, a flexible volume that
contains shared data blocks. The NetApp Data ONTAP file system, WAFL, is a file system structure
that supports shared blocks in order to optimize storage space consumption. Basically, within one file
system tree there is the ability to have multiple references to the same data block, as shown in Figure
2.

NetApp, Inc. Overview


3
Technical Report: 16 April 2008
NetApp Deduplication for FAS TR-3505
Deployment and Implementation Guide 4th Revision

Figure 2) Dense volumes.

To keep track of the many indirect blocks (“IND” in Figure 2) that are pointing to it, each data block
has a block count reference kept in the volume metadata. As additional indirect blocks point to it or
existing ones stop pointing to it, this value is incremented or decremented accordingly. When no
indirect blocks point to a data block, it is released.
Deduplication uses dense volume technology to allow duplicate blocks anywhere in the flexible
volume to be deleted.

2.3 Deduplication Features and Functions


Deduplication provides block-level deduplication within the entire flexible volume on NetApp
NearStore® storage systems. The depiction of how this works, at the highest level, is shown in Figure
3.

Figure 3) How NetApp deduplication for FAS works.

NetApp, Inc. Overview


4
Technical Report: 16 April 2008
NetApp Deduplication for FAS TR-3505
Deployment and Implementation Guide 4th Revision

Essentially, deduplication only stores unique blocks in the flexible volume and creates a small amount
of additional metadata in the process. Notable features of NetApp deduplication for FAS include:
ƒ Works with a high degree of granularity, at the block level.
ƒ Operates on the active file system of the flexible volume. Snapshot copies created after
running deduplication enjoy the same storage savings benefits.
ƒ Is a background process that can be configured to run automatically, scheduled, or run
manually through the command-line interface.
ƒ Is application transparent and therefore can be used for deduplication of data originating from
anywhere in the data center.
ƒ Is enabled and managed using a simple command-line interface.
ƒ Can be enabled on and deduplicate blocks on flexible volumes with existing data too.

The remainder of this document goes into great detail on the operation of deduplication, but in
general the following occurs:
Newly saved data on the NearStore is stored in blocks as usual by Data ONTAP. Each block
of data has a digital fingerprint, which is compared to all other fingerprints in the flexible
volume. If two fingerprints are found to be the same, a byte-for-byte comparison is done of all
bytes in the block, and, if there is an exact match between the new block and the existing
block on the flexible volume, the duplicate block is discarded and its disk space is reclaimed.

2.3.1 General Deduplication Operational Considerations


Is enabled on a per flexible volume basis.
Can be enabled on any number of flexible volumes.
Can be run one of three ways:
ƒ Scheduled on specific days and at specific times
ƒ Manually via the command line
ƒ Automatically, when 20% new data has been written to the volume
Only one deduplication process runs on a flexible volume at a time.
Up to eight deduplication processes can run concurrently on the same NetApp storage array.
Deduplication is supported in an active/active clustered failover configuration. For clarifying details,
see the “Deduplication and Active/Active Configuration” section.

NetApp, Inc. Overview


5
Technical Report: 16 April 2008
NetApp Deduplication for FAS TR-3505
Deployment and Implementation Guide 4th Revision

3 Configuration and Operation


This section discusses what is required to install deduplication, how to configure it, and various
aspects of managing it. Although this section discusses some basic things, in general it assumes both
that the NetApp storage system is already installed and running, and that the reader is familiar with
basic NetApp Data ONTAP administration.

3.1 Requirements Overview


Table 1 specifies the requirements for deduplication.

Table 1) Deduplication requirements overview.


Hardware NearStore R200
FAS2020, FAS2050
FAS3020, FAS3040, FAS3050, FAS3070
FAS6030, FAS6040, FAS6070, FAS6080
IBM: N5200, N5300, N5500, N5600, N7600, N7800
Data ONTAP Data ONTAP 7.2.4 or later
Software nearstore_option (for all platforms except R200) license
a_sis license

Maximum Flexible FAS6070, FAS6080, N7800: 16TB


Volume Size
FAS6030, FAS6040, N7600: 10TB
FAS3070, N5600: 6TB
NearStore R200: 4TB
FAS3040, N5300: 3TB
FAS3050, N5500: 2TB
FAS3020, N5200: 1TB
FAS2050: 1TB
FAS2020: 0.5TB
Protocols All file-based and block-based protocols supported by Data ONTAP
Applications Refer to the “Deduplication Target Environment” section

3.2 Installing and Licensing Deduplication


Deduplication is included in Data ONTAP and just needs to be licensed. Add the deduplication
license using the following command:
license add <a_sis>
If you want to run deduplication on any of the FAS platforms you will also need to add the
nearstore_option license:
license add <nearstore_option>

NetApp, Inc. Configuration and Operation


6
Technical Report: 16 April 2008
NetApp Deduplication for FAS TR-3505
Deployment and Implementation Guide 4th Revision

3.2.1 Deduplication Licensing in a Clustered Environment


Deduplication is a licensed option behind the NearStore option license. Hence, in a clustered
environment, both nodes must have the NearStore option and deduplication licensed.

3.3 Command Summary


Table 2 provides a description of all deduplication (related) commands. Cells that are shaded indicate
those commands that are only available via “priv set diag”.

Table 2) Deduplication command summary.


sis on <vol> Activates deduplication on the flexible volume specified.
sis start -s <vol> Begins deduplication process on the flexible volume
specified.
Using the -s option tells the deduplication operation to
scan the flexible volume specified and process existing
data.
This option should only be used upon initial configuration
and deduplication on a flexible volume.
sis start <vol> Begins deduplication process on the flexible volume
specified.

sis status [-l] <vol> Returns current status of deduplication for the specified
flexible volume.
The -l option causes a long listing to be displayed.
df –s <vol> Returns the value of deduplication space savings in the
active file system for the specified flexible volume.

sis config [-s sched]\ Creates an automated deduplication sched(ule).


<vol>
The syntax follows the SnapVault syntax model.
When deduplication is first enabled on a flexible volume, a
default schedule is configured, running it each day of the
week at midnight.
sis stop <vol> Suspends the deduplication process (if one is running) on
the flexible volume specified.

sis off <vol> Deactivates deduplication on the flexible volume specified.


This means there will be no more change logging or
deduplication operations, but the flexible volume will
remain a dense volume, and the storage savings will be
kept.
If this command is used, and then deduplication is turned
back on for this flexible volume, the flexible volume will
need to be rescanned with the ”sis start –s”
command.

NetApp, Inc. Configuration and Operation


7
Technical Report: 16 April 2008
NetApp Deduplication for FAS TR-3505
Deployment and Implementation Guide 4th Revision

sis check <vol> Verifies and updates the fingerprint database for the
specified flexible volume and includes purging stale
fingerprints.
sis stat <vol> Displays the statistics of flexible volumes that have
deduplication enabled.

sis undo <vol> Converts an deduplication-enabled flexible volume to a


normal flexible volume.

3.4 Deduplication Quick Start Guide


This section provides a quick run-through of the steps to configure and manage deduplication.

Table 3) Deduplication quick overview.


New Flexible Volume Flexible Volume with Existing Data
Flexible Volume Create flexible volume.
Configuration

Enable sis on <vol>


Deduplication on
Flexible Volume
Initial Scan Not applicable. Scan/deduplicate the existing data.
sis start -s <vol>

Create, Modify, Delete or modify the default deduplication schedule that was configured
Delete Schedules when deduplication was first enabled on the flexible volume or create
(if not doing desired schedule.
manually) sis config [-s sched] <vol>
Manually Run sis start <vol>
Deduplication (if
not using
schedules)
Monitor Status of sis status <vol>
Deduplication

Monitor Space df –s <vol>


Savings

3.5 Monitoring Deduplication Status


This section describes the meaning of various status messages about deduplication. The “sis
status” command is the primary command used to report on the status of deduplication for a
specific flexible volume or all the flexible volumes.

NetApp, Inc. Configuration and Operation


8
Technical Report: 16 April 2008
NetApp Deduplication for FAS TR-3505
Deployment and Implementation Guide 4th Revision

Below, from the sis man page, you see the various State, Status, and Progress messages that can
be returned when running sis status. Note that if you don’t provide a flexible volume name, the
status for all flexible volumes that have deduplication enabled will be displayed.
toaster> sis status
Path State Status Progress
/vol/dvol_1 Enabled Idle Idle for 10:45:23
/vol/dvol_2 Enabled Pending Idle for 15:23:41
/vol/dvol_3 Disabled Idle Idle for 37:12:34
/vol/dvol_4 Enabled Active 25 GB Scanned
/vol/dvol_5 Enabled Active 25 MB Searched
/vol/dvol_6 Enabled Active 40 MB (20%) Done
/vol/dvol_7 Enabled Active 30 MB Verified
/vol/dvol_8 Enabled Active 10% Merged

And following is a textual description of the meaning for each flexible volume:
ƒ dvol_1 is Idle. The last deduplication operation on the flexible volume was finished 10:45:23
ago.
ƒ dvol_2 is Pending for resource limitation. The deduplication operation on the flexible volume
will become Active when the resource is available.
ƒ dvol_3 is Idle because the deduplication operation is disabled on the flexible volume.
ƒ dvol_4 is Active. The deduplication operation is doing the whole flexible volume scanning
(initiated with “sis start –s”). So far, it has scanned 25GB of data.
ƒ dvol_5 is Active. The operation is searching for duplicate data, and 25MB of data has already
been searched.
ƒ dvol_6 is also Active. The operation has saved 40MB of data. This is 20% of the total
duplicate data found in the searching stage.
ƒ dvol_7 is Active. It is verifying the metadata of processed data blocks. This process will
remove unused metadata.
ƒ dvol_8 is Active. Verified metadata are being merged. This process will merge together all
verified metadata of processed data blocks to an internal format that supports fast sis
operation.

The general flow of the phases deduplication goes through and the correlating sis status
messages when actively running on a flexible volume are shown in Figure 4.

NetApp, Inc. Configuration and Operation


9
Technical Report: 16 April 2008
NetApp Deduplication for FAS TR-3505
Deployment and Implementation Guide 4th Revision

Figure 4) Deduplication status messages and their correlation to deduplication phases.

For additional information, the -l option will display detailed status, as shown below.
toaster> sis status -l /vol/dvol_6
Path: /vol/dvol_6
State: Enabled
Status: Active
Progress: 41020 KB (20%) Done
Type: Regular
Schedule: sun-sat@0
Last Operation Begin: Thu Mar 24 13:30:00 PST 2005
Last Operation End: Fri Mar 25 00:34:16 PST 2005
Last Operation Size: 4732932 KB
Last Operation Error: -

3.6 End-to-End Deduplication Configuration Example


This section steps through the entire typical process of creating a flexible volume and configuring,
running, and monitoring deduplication on it. (Note that steps are spelled out in detail, so it appears a
lot lengthier than it would be in the real world.)
In this example we want a place to archive a number of large PST files various users have created
and are maintaining. The destination NetApp storage system is called r200-rtp01, and it is assumed
that deduplication has been licensed on this machine. As NetApp storage arrays are multiprotocol
boxes, in this example we’ll actually be using a UNIX server to copy the PST data.

NetApp, Inc. Configuration and Operation


10
Technical Report: 16 April 2008
NetApp Deduplication for FAS TR-3505
Deployment and Implementation Guide 4th Revision

1. Begin by creating a flexible volume (keeping in mind the maximum allowable volume size for the
platform, as specified in the requirements table at the beginning of this section).
r200-rtp01*> vol create VolPST aggr0 200g
Creation of volume 'VolPST' with size 200g on containing aggregate
'aggr0' has completed.

2. Now, as a best practice, we’ll disable scheduled Snapshot copies. An alternative to what’s shown
below would be to use the command “snap sched VolPST 0 0 0”.
r200-rtp01*> vol status VolPST
Volume State Status Options
VolPST online raid_dp, flex
Containing aggregate: 'aggr0'
r200-rtp01*> vol options VolPST nosnap true
r200-rtp01*> vol status VolPST
Volume State Status Options
VolPST online raid_dp, flex nosnap=on
Containing aggregate: 'aggr0'
3. Now we’ll enable deduplication on the flexible volume and verify that it’s turned on. The vol
status command will show a sis attribute for flexible volumes that have deduplication turned
on. (It can be a bit confusing, since sis is also indicated for those flexible volumes that have
been written to by SnapVault for NetBackup.)

Note that there needs to be space available in the flexible volume for the sis on command to
complete successfully. That is, if the sis on command were attempted on a flexible volume that
already had data and was completely full, it would fail (since there is no room to create the
required metadata).

Note that after turning deduplication on, Data ONTAP lets you know that if this were an existing
flexible volume that already contained data prior to deduplication being enabled, you would want
to run sis start –s; in this example it’s a brand-new flexible volume, so that’s not necessary.

r200-rtp01*> sis on /vol/VolPST


SIS for "/vol/VolPST" is enabled.
Already existing data could be processed by running "sis start -s
/vol/VolPST".
r200-rtp01*> vol status VolPST
Volume State Status Options
VolPST online raid_dp, flex nosnap=on
sis
Containing aggregate: 'aggr0'

NetApp, Inc. Configuration and Operation


11
Technical Report: 16 April 2008
NetApp Deduplication for FAS TR-3505
Deployment and Implementation Guide 4th Revision

4. Another way to verify that deduplication is enabled on the flexible volume is to just check the
output from running sis status on the flexible volume.
r200-rtp01*> sis status /vol/VolPST
Path State Status Progress
/vol/VolPST Enabled Idle Idle for 00:00:20

5. Next we’ll turn off the default deduplication schedule. Since in this example the administrators will
be moving large quantities of PST files in as time permits, we’ll want to let them run deduplication
manually at opportune times.
r200-rtp01*> sis config /vol/VolPST
Path Schedule
/vol/VolPST sun-sat@0
r200-rtp01*> sis config -s - /vol/VolPST
r200-rtp01*> sis config /vol/VolPST
Path Schedule
/vol/VolPST -

At this point, in our example, the administrator NFS-mounted the flexible volume to /testPSTs on a
Solaris™ host, sunv240-rtp01, and copied lots of PST files from their users’ directories into our
new PST archive directory flexible volume. The result from the host perspective is shown below.
(Obviously the same sort of thing could be accomplished by mapping a CIFS share to a Windows
host.)
root@sunv240-rtp01 # pwd
/testPSTs
root@sunv240-rtp01 # df -k .
Filesystem kbytes used avail capacity Mounted on
r200-rtp01:/vol/VolPST
167772160 33388384 134383776 20% /testPSTs
The example continues with examining the flexible volume, running deduplication, and monitoring the
status.

6. Use df –s to examine the storage consumed and the space savings provided. Note that no space
savings have been achieved by simply copying data to the flexible volume even though
deduplication is turned on. What has happened is that all the blocks that have been written to this
flexible volume since deduplication was turned on have had their fingerprints written to the
change log file.
r200-rtp01*> df -s /vol/VolPST
Filesystem used saved %saved
/vol/VolPST/ 33388384 0 0%

7. Start deduplication running on the flexible volume. This causes the change log to be processed,
fingerprints to be sorted and merged, and duplicate blocks to be found.
r200-rtp01*> sis start /vol/VolPST

NetApp, Inc. Configuration and Operation


12
Technical Report: 16 April 2008
NetApp Deduplication for FAS TR-3505
Deployment and Implementation Guide 4th Revision

The SIS operation for "/vol/VolPST" is started.

8. Use sis status to monitor the progress of deduplication.


r200-rtp01*> sis status /vol/VolPST
Path State Status Progress
/vol/VolPST Enabled Active 9211 MB Searched

r200-rtp01*> sis status /vol/VolPST


Path State Status Progress
/vol/VolPST Enabled Active 11 MB (0%) Done

r200-rtp01*> sis status /vol/VolPST


Path State Status Progress
/vol/VolPST Enabled Active 1692 MB (14%) Done

r200-rtp01*> sis status /vol/VolPST


Path State Status Progress
/vol/VolPST Enabled Active 10 GB (90%) Done

r200-rtp01*> sis status /vol/VolPST


Path State Status Progress
/vol/VolPST Enabled Active 11 GB (99%) Done

r200-rtp01*> sis status /vol/VolPST


Path State Status Progress
/vol/VolPST Enabled Idle Idle for 00:00:07

9. Once sis status indicates the flexible volume is once again in the Idle state, deduplication has
finished running, and we can now check the space savings it provided in the flexible volume.
r200-rtp01*> df -s /vol/VolPST
Filesystem used saved %saved
/vol/VolPST/ 24072140 9316052 28%

That’s all there is to it.

NetApp, Inc. Configuration and Operation


13
Technical Report: 16 April 2008
NetApp Deduplication for FAS TR-3505
Deployment and Implementation Guide 4th Revision

3.7 Configuring Deduplication Schedules


This section provides some specifics about configuring schedules with deduplication.
The sis config command is used to configure and view deduplication schedules for flexible
volumes. Usage syntax is shown below.
r200-rtp01*> sis help config
sis config [ [ -s schedule ] <path> | <path> ... ]
- Sets up, modifies, and retrieves the schedule of SIS
volumes.

Run with no arguments, sis config will return the schedules for all flexible volumes that have
deduplication enabled. The example below shows the four different formats the reported schedules
can have.
toaster> sis config
Path Schedule
/vol/dvol_1 -
/vol/dvol_2 23@sun-fri
/vol/dvol_3 auto
/vol/dvol_4 sat@6

The meaning of each of these schedule types is as follows.


ƒ On flexible volume dvol_1 deduplication is not scheduled to run.
ƒ On flexible volume dvol_2 deduplication is scheduled to run every day from Sunday to Friday
at 11 p.m.
ƒ On flexible volume dvol_3 deduplication is set to auto schedule. This means deduplication
will be triggered by the amount of new data written to the flexible volume, specifically when
there are 20% new fingerprints in the change log.
ƒ On flexible volume dvol_4 deduplication is scheduled to run at 6 a.m. on Saturday.

When the -s option is specified, the command will set up or modify the schedule on the specified
flexible volume. The schedule parameter can be specified in one of four ways:
[day_list][@hour_list]
[hour_list][@day_list]
-
auto

The day_list specifies which days of the week deduplication should run. It is a comma-separated
list of the first three letters of the day: sun, mon, tue, wed, thu, fri, sat. The names are not case
sensitive. Day ranges such as mon-fri can also be given. The default day_list is sun-sat.
The hour_list specifies which hours of the day deduplication should run on each scheduled day.
The hour_list is a comma-separated list of the integers from 0 to 23. Hour ranges such as 8-17
are allowed.

NetApp, Inc. Configuration and Operation


14
Technical Report: 16 April 2008
NetApp Deduplication for FAS TR-3505
Deployment and Implementation Guide 4th Revision

Step values can be used in conjunction with ranges. For example, 0-23/2 means "every two hours."
The default hour_list is 0 (that is, midnight on the morning of each scheduled day).
If "-" is specified, there won't be a scheduled deduplication operation on the flexible volume.
The “auto” schedule causes deduplication to run on that flexible volume whenever there are 20% new
fingerprints in the change log. This check is done in a background process and occurs every minute.
When deduplication is enabled on a flexible volume the first time, an initial schedule is assigned to
the flexible volume. This initial schedule is sun-sat@0, which means "once every day at midnight."
To configure the schedules shown earlier in this section, the following commands would be issued:
toaster> sis config -s - /vol/dvol_1
toaster> sis config -s 23@sun-fri /vol/dvol_2
toaster> sis config –s auto /vol/dvol3
toaster> sis config –s sat@6 /vol/dvol_4

NetApp, Inc. Configuration and Operation


15
Technical Report: 16 April 2008
NetApp Deduplication for FAS TR-3505
Deployment and Implementation Guide 4th Revision

4 Operating Characteristics
This section discusses where deduplication makes sense and the behavior that you can expect.

4.1 Deduplication Target Environment


This section discusses where deduplication is a good fit.
Deduplication supports flexible volumes that have data written to them using CIFS or NFS, or as
LUNs accessed using FCP/iSCSI. Basically it doesn’t matter how the data got on the NetApp storage
system; deduplication will deduplicate it.
Deduplication was initially targeted to data retention/archival environments in its first release (Data
ONTAP 7.2L1), focusing on archives of file data: for example, home directories, engineering
development, Microsoft® Office, e-mail archive, SharePoint, technical and general publications, and
so on.
Substantial benefit can be achieved in some tier 2 primary storage environments as well. Typically,
Home Directory and VMware environments are especially well-suited.
Deduplication is supported in disaster recovery configurations where SnapMirror® is used; see the
“Replication and SnapMirror” section for specific details.

4.2 Deduplication Performance


Deduplication is tightly integrated with Data ONTAP and the WAFL file structure. Because of this,
deduplication is performed with extreme efficiency. Complex hashing algorithms and look-up tables
are not required. Instead, deduplication is able to leverage the internal characteristics of Data ONTAP
to create and compare digital fingerprints, redirect data pointers, and free up redundant data areas,
all with a minimal amount of performance impact.

4.3 Deduplication Storage Savings


While deduplication can deduplicate any blocks in a flexible volume of the NetApp storage system,
the storage savings achieved can vary based on the data set.
Running deduplication one time on a single data set can provide the storage savings that cover the
spectrum of 10% to 90%, with 30% to 50% being typical.
In cases where customers are backing up or archiving data over and over again, the realized storage
savings deduplication can provide get better and better, achieving 20:1 (95%) and higher over time.

4.4 Additional Deduplication Considerations


This section provides some discussion on other deduplication-related topics. Some of this information
may be covered elsewhere, but it bears reiterating here.
First, refer to the deduplication requirements table (Table 1) in the beginning of section 3 for specific
supported hardware and software and necessary licenses.

NetApp, Inc. Operating Characteristics


16
Technical Report: 16 April 2008
NetApp Deduplication for FAS TR-3505
Deployment and Implementation Guide 4th Revision

4.4.1 Number of Deduplication Processes


A maximum of eight deduplication processes can be run at the same time on the same NearStore
device.
ƒ If another flexible volume is scheduled to have deduplication run while eight deduplication
processes are already running, deduplication for this additional flexible volume will be
queued. For example, say a user sets a default schedule (sun-sat@0) for 10 deduplication
volumes. Eight will run at midnight, and the remaining two will be queued.
ƒ As soon as one of the eight current deduplication processes completes, one of the
queued ones will start, and when another deduplication process completes, the
second queued one will start.
ƒ Next time deduplication is scheduled to run on these same 10 flexible volumes, a
round-robin paradigm will be used so the same ones aren’t always the first ones run.
ƒ For manually triggered deduplication runs, if eight deduplication processes are already
running when a command is issued to start another one, the request will fail, and the
operation will not be queued.

4.4.2 Deduplication and Active/Active Configuration


NetApp cluster services are supported with deduplication in the following manner upon failover to the
partner node.
ƒ Writes to the flexible volume will have fingerprints written to the change log.
ƒ No sis administration operations or deduplication will function.
ƒ Upon failback, normal deduplication operations can continue and the updated change log
processed.
Deduplication is a licensed option behind the NearStore option license. Our best practice
recommendation is to have both nodes in an active/active configuration licensed with the NearStore
option and deduplication.

4.4.3 Deduplication and Space Savings on Existing Data


A major benefit of deduplication is that it can be used to deduplicate existing data on previously used
flexible volumes (after upgrading to Data ONTAP 7.4). It is completely realistic to assume that
Snapshot copies may exist. What happens when you run deduplication in this case?
When you first run deduplication on this flexible volume, the storage savings will probably be rather
small or even nonexistent because existing Snapshot copies are not deduplicated.
ƒ Previous Snapshot copies will expire, and as they do some small savings will be realized, but
they too will probably be pretty low.
ƒ During this period of old Snapshot copies expiring, it is fair to assume new data is being
created on the flexible volume and Snapshot copies being created.
ƒ Thus the storage savings may stay rather flat (that is, very low).
ƒ When the last Snapshot copy that was created before deduplication was run is deleted, the
storage savings should increase noticeably.

NetApp, Inc. Operating Characteristics


17
Technical Report: 16 April 2008
NetApp Deduplication for FAS TR-3505
Deployment and Implementation Guide 4th Revision

4.4.4 Deduplication Best Practices


This section contains general rules of thumb that might not have been covered elsewhere in this
document.

ƒ If there is very little new data, run deduplication infrequently, because it doesn't make sense
to unnecessarily consume CPU resources. How often you run it will depend on the change
rate of the data in the flexible volume.
ƒ The best options are:
ƒ Use the auto mode so that deduplication only runs when significant additional data
has been written to each particular flexible volume (this will tend to naturally spread
out when deduplication runs).
ƒ Stagger deduplication schedules for the flexible volumes so it runs on alternative
days.
ƒ Run deduplication manually.
ƒ Run deduplication before creating Snapshot copies, as this will ensure no undeduplicated
data gets locked in Snapshot copies. If a Snapshot copy is created on a flexible volume
before deduplication has a chance to run/complete on that flexible volume, this could result in
lower space savings.
ƒ The Snapshot reserve should be greater than 0 if Snapshot copies are to be used. (An
exception to this might be in a SAN environment, where often it is set to zero for thin
provisioning of LUNs.)
ƒ There must be some free space in the flexible volume to allow deduplication to operate and
create the metadata it requires. As necessary, flexible volumes can be resized, with no
impact to data access, to accommodate this.

NetApp, Inc. Operating Characteristics


18
Technical Report: 16 April 2008
NetApp Deduplication for FAS TR-3505
Deployment and Implementation Guide 4th Revision

5 Common Problems and Troubleshooting


This section covers issues that have been known to come up when configuring and running
deduplication.

5.1 Licensing
Make sure deduplication is properly licensed and, if the platform is not an R200, make sure the
NearStore option is also properly licensed:
fas3070-rtp01*> license

a_sis <license>
nearstore_option <license>

5.2 Volume Sizes


Adhere to the deduplication volume size limits presented in the “Requirements Overview” section. If
you exceed them you will not be able to enable deduplication on that volume.
Below is an example of the message displayed if the volume is too large to enable deduplication.
london-fs3> sis on /vol/projects

Volume or maxfiles exceeded max allowed for SIS: /vol/projects

Also note that there needs to be free space available in the flexible volume for the “sis on”
command to complete successfully. If a flexible volume is full, deduplication will not run. However, as
noted earlier, flexible volumes can be resized with no impact to data access to accommodate this.

5.3 Logs and Error Messages


New error log: /etc/log/sis
New error messages
Registry errors: Check if vol0 is full.
Metafile op errors: Check if the deduplication flexible volume is full.
License errors: Check if license is installed.
Change log full error: Perform a “sis start” operation that will empty the change log
metafile when finished.

5.4 Other Issues


Refer to the Data ONTAP 7.4 release notes for complete information.

NetApp, Inc. Common Problems and Troubleshooting


19
Technical Report: 16 April 2008
NetApp Deduplication for FAS TR-3505
Deployment and Implementation Guide 4th Revision

5.5 Not Seeing Space Savings


If you’ve run deduplication on a flexible volume that you’re confident contains data that should
deduplicate well, yet you are not seeing any space savings, there’s a good chance a number of
Snapshot copies exist and are locking a lot of data. This especially tends to happen when people run
deduplication on existing flexible volumes of data.
Use the “snap list” command to see what Snapshot copies exist and the “snap delete”
command to remove them. Alternatively, wait for the Snapshot copies to expire, and the space
savings will appear.

5.6 Undeduplicating a Flexible Volume


It is possible, and easy, to “undeduplicate” a flexible volume that has deduplication enabled, by
backing out deduplication and turning it back into a “regular” (non-dense) flexible volume. This can be
done while the flexible volume is online and is accomplished as described below.
Turn deduplication off on the flexible volume. (Note that this command stops fingerprints from being
written to the change log as new data is written to the flexible volume. If this command is used, and
then deduplication is turned back on for this flexible volume, the flexible volume will need to be
rescanned with the ”sis start –s” command.)
sis off <flexvol>
Use the following command1 to recreate the duplicate blocks in the flexible volume.
sis undo <flexvol>
When this command completes, it will delete the fingerprint file and the change log files.
Below is an example of undeduplicating a flexible volume.
r200-rtp01*> df –s /vol/VolReallyBig2
/vol/VolReallyBig2/ 20568276 3768732 15%
r200-rtp01*> sis status /vol/VolReallyBig2
Path State Status Progress
/vol/VolReallyBig2 Enabled Idle Idle for 11:11:13
r200-rtp01*> sis off /vol/VolReallyBig2
SIS for "/vol/VolReallyBig2" is disabled.
r200-rtp01*> sis status /vol/VolReallyBig2
Path State Status Progress
/vol/VolReallyBig2 Disabled Idle Idle for 11:11:34
r200-rtp01*> sis undo /vol/VolReallyBig2
Wed Feb 7 11:13:15 EST [wafl.scan.start:info]: Starting SIS volume scan on
volume VolReallyBig2.
r200-rtp01*> sis status /vol/VolReallyBig2
Path State Status Progress
/vol/VolReallyBig2 Disabled Undoing 424 MB Processed
r200-rtp01*> sis status /vol/VolReallyBig2

1
Note that the undo option of the sis command is only available in the diag mode, accessed using
the command “priv set diag”.

NetApp, Inc. Common Problems and Troubleshooting


20
Technical Report: 16 April 2008
NetApp Deduplication for FAS TR-3505
Deployment and Implementation Guide 4th Revision

No status entry found.


r200-rtp01*> df -s /vol/VolReallyBig2
Filesystem used saved %saved
/vol/VolReallyBig2/ 24149560 0 0%

Note that if sis undo starts processing and then there is not enough space to undeduplicate, it will
stop, complain with a message about insufficient space, and leave the flexible volume dense. All data
is still accessible, but some block sharing is still occurring. Use “df –s” to understand how much free
space you really have and then either grow the flexible volume or delete data or Snapshot copies to
provide the needed free space.

5.7 Additional Reporting with “sis stat –l”


For additional status information, you can do “priv set diag” and use the “sis stat –l”
command for long detailed listings.

5.8 Deduplication and Reboots


If a NetApp storage system is rebooted when deduplication is running, when it reboots deduplication
will be in the “Idle” state for that flexible volume. When the next deduplication processing for that
flexible volume starts, it will clean up any remaining intermediate metadata that was created by the
previous deduplication operation.

NetApp, Inc. Common Problems and Troubleshooting


21
Technical Report: 16 April 2008
NetApp Deduplication for FAS TR-3505
Deployment and Implementation Guide 4th Revision

6 Deduplication and Replication


Although there are substantial benefits to be achieved with deduplication, a complete solution will
most likely involve the need to additionally mirror it to another location for disaster recovery purposes.
Replication of the deduplication-enabled flexible volume is supported using NetApp SnapMirror in two
ways, as discussed in the next two subsections.

6.1 Replicating a Deduplicated Flexible Volume for DR


A deduplicated flexible volume can be replicated to a secondary storage system (destination) using
Volume SnapMirror (VSM) as shown in Figure 5.

Figure 5) VSM of a deduplicated flexible volume for disaster recovery.

Key points in this scenario are:


ƒ The nearstore_option must be licensed on both the source and destination.
ƒ Deduplication must be licensed at the primary location (source).
ƒ Deduplication does not need to be licensed at the destination. However, if there is a situation
in which the primary site is down and the secondary location becomes the new primary,
deduplication needs to be licensed for continued deduplication to occur. Thus, the best
practice is to have deduplication licensed at both locations.
ƒ Deduplication is only enabled, run, and managed from the primary location.
ƒ The flexible volume at the secondary location will “inherit” all the deduplication attributes and
storage savings through SnapMirror.
ƒ Only unique blocks are transferred, so deduplication reduces network bandwidth usage too.

NetApp, Inc. Deduplication and Replication


22
Technical Report: 16 April 2008
NetApp Deduplication for FAS TR-3505
Deployment and Implementation Guide 4th Revision

6.2 Replicating Primary Data to a Deduplicated Flexible


Volume
A production primary flexible volume can be replicated to a deduplication-enabled flexible volume on
a secondary storage system using Qtree SnapMirror (QSM), as shown in Figure 6.

Figure 6) QSM of production data to a deduplicated flexible volume.

Key points in this scenario are:


ƒ The nearstore_option must be licensed on the destination.
ƒ Deduplication is only licensed at the secondary location (destination).
ƒ Deduplication is enabled, run, and managed on a flexible volume at the secondary location.
ƒ Deduplication doesn’t yield any network bandwidth savings as QSM works at the logical
layer.
ƒ Storage savings benefit at the QSM destination is achieved by running deduplication on the
destination after QSM has finished transferring the data.

NetApp, Inc. Deduplication and Replication


23
Technical Report: 16 April 2008
NetApp Deduplication for FAS TR-3505
Deployment and Implementation Guide 4th Revision

NetApp, Inc.

© 2008 NetApp, Inc. All rights reserved. Specifications subject to change without notice. NetApp, the NetApp logo, Data ONTAP, FlexClone, FlexVol,
NearStore, SnapMirror, SnapVault, and WAFL are registered trademarks and NetApp and Snapshot are trademarks of NetApp, Inc. in the U.S. and
other countries. Solaris is a trademark of Sun Microsystems, Inc. Windows and Microsoft are registered trademarks of Microsoft Corporation. UNIX is
a registered trademark of The Open Group. NetBackup is a trademark of Symantec Corporation or its affiliates in the U.S. and other countries. All
other brands or products are trademarks or registered trademarks of their respective holders and should be treated as such.

24