Professional Documents
Culture Documents
Deduplication Option
White Paper: Backup Exec 2010: Deduplication Option
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
What is Deduplication?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Client Deduplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Appliance Deduplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Backup Exec 2010: Deduplication Option
Introduction
Customers of all sizes and needs are seeking new ways to tackle their data protection challenges. While the challenges of
data growth are not new, the pace of growth has become more rapid, the location of data more dispersed, and linkages
between data sets more complex. Data deduplication offers companies the opportunity to dramatically reduce the amount
of storage required for backups and to more efficiently centralize backup data from multiple sites for assured disaster
recovery. Symantec Backup Exec 2010 now includes integrated deduplication capabilities that can help with the task of
managing these data protection challenges. Deduplication capabilities are available when customers purchase the
Deduplication Option for Backup Exec 2010.
The Deduplication Option enables several capabilities that can greatly benefit Administrators looking to control storage
growth. There are three different methods of deduplication that are available with the Deduplication Option. The first is
Client Deduplication, where data is deduplicated at the source and sent to the media server in deduplicated form. The
second is Media Server Deduplication, where data is deduplicated in-line, or as it arrives, at the media server. The third
Deduplication, where a 3rd-party deduplication device (for example, an ExaGrid EX series or a Quantum DXi
is Appliance Deduplication
7500 ) handles all aspects of deduplication. These devices, with manufacturer-supplied Symantec OpenStorage (OST)
plugins can realize enhanced performance and manageability when used in conjunction with Backup Exec. The next
sections will go into detail about each of these deduplication types.
What is Deduplication?
What is deduplication? At the core, deduplication is a process that breaks down files and data into “segments” and uses a
tracking database to ensure the Media Server only stores a single copy of that segment across all client backup data
stored to that media server. For subsequent backups of any client, the tracking database knows what segments have been
protected and only transfers and stores the segments that are new or unique – file segments that are not currently stored
by that Media Server. For example, if five different client systems are backing up data to a Media Server and a file
segment is found that exists on all five of those client systems, only a single copy of the segment will actually be stored by
the Media Server. This tracking database ensures that these segments are kept until any existing disk-based backup no
longer references them. Because only a fraction of the original data is eligible to be stored by the media server, this leads
to significant reduction in disk space needed for backups.
For recovery, the requested file is reconstructed based on the information contained in the tracking database, and then
sent on to the destination.
The deduplication domain is across all backups done with Backup Exec’s deduplication technology. The benefit of this
methodology is that all of the deduplication segment information mentioned above is shared with all other backups
configured to use deduplication for a specific media server. For example, if two Windows 2008 R2 servers are protected
using either Client or Media Server deduplication, only deduplication segments that are unique to either of those servers
will be stored. This helps significantly reduce backup disk utilization across all local and remote servers protected with
deduplication.
1
Backup Exec 2010: Deduplication Option
Regardless of the methodology used for deduplication – Client, Media Server, or Appliance – the end result is the same:
storage is optimized by only storing unique parts of a particular file or data stream, and using some form of database to
associate segments to each other and to the machines where they were backed up from.
With the Backup Exec 2010 Deduplication Option, Administrators have the ability to choose when and where
deduplication takes place. Administrators can mix and match deduplication types to fit their unique needs; for example, a
single media server licensed with the Deduplication Option can simultaneously use Client Deduplication for some jobs,
Media Server deduplication for others, and use Appliance Deduplication for yet another set of jobs.
• Client Deduplication is a software-driven process, where deduplication takes place at the SOURCE of data and
is sent over the network in deduplicated form
• Media Server Deduplication is a software-driven process, where deduplication takes place JUST BEFORE THE
DATA IS STORED TO DISK (also known as Inline Deduplication)
• Appliance Deduplication is a hardware-driven process takes place ON THE DEDUPLICATION APPLIANCE (Can
be In-Line or Post-Process Deduplication, ex. ExaGrid or Quantum)
Each approach has its benefits and will be detailed in turn in the following sections.
Client Deduplication
With Backup Exec 2010, exciting new possibilities for Remote Office protection have been introduced. The concept of
Client Deduplication – where the remote system is responsible for deduplication calculations and where backup data is
sent over the network in its deduplicated form – can make the process of protecting remote offices a much more
streamlined experience. Remote offices can be challenging to protect effectively; WAN environments can be unreliable
and can only utilize a fraction of the bandwidth available to a LAN backup. Backups over the WAN a challenge to set up,
2
Backup Exec 2010: Deduplication Option
much less complete. Other customers have Media Servers which are not as powerful as the application servers they are
protecting – often, the VMware server or the Exchange server in the environment is the most powerful machine available
in terms of processor speed or disk throughput. Where appropriate, why not leverage some of this remote computing
power to achieve faster backups? Both of these situations are problems where Client Deduplication can offer a
comprehensive solution to the data protection challenges brought on by the environment.
Generally, remote offices backup strategies have two basic architectures. First, there are Remote Offices which do not
have local storage, and where backup data is sent directly over the LAN or WAN to the data center for protection. Second,
there are Remote Offices that employ local storage and then “forward” that locally stored backup data to the data center
for protection. Both of these configurations can use Backup Exec 2010’s Deduplication Option to streamline and improve
backup and recovery for Remote Offices.
Client deduplication is the act of deduplicating data at the backup source – that is, on the machine that is being backed
up. Data from the client system is refined into smaller deduplication segments, and only the unique data (i.e. data the
media server doesn’t yet contain) is sent in deduplicated form to the Media Server’s Deduplication Storage Folder. A
Deduplication Storage Folder is special type of Backup to Disk folder where all deduplication segments are stored,
regardless of whether Client or Media Server deduplication was used for a specific backup. With this method, the majority
of the processing necessary for deduplication is done on the remote system rather than as the data arrives at the media
server. Client deduplication is the default deduplication method Symantec recommends for several reasons:
• Client deduplication enables greater scalability: Client deduplication spreads processor usage out across all
clients running backups, enabling the Media Server to process more concurrent backups.
• Client deduplication minimizes network data transfers: Data is deduplicated at the client and sent across
the network in deduplicated form. In this way, only the unique data is sent to the media server, rather than the
entire backup stream. Most environments – be it a LAN or a WAN environment – can benefit from less data
being sent across the network. This is especially useful for WAN or Remote Office protection. Often, the data
sent over the wire is only 1/10th or less the original data size, reducing network traffic and contention for
network resources.
Over time, Client Deduplication can provide overall deduplication rates of 9:1 or higher for file system backups. This
equates to a 90% (or more) saving in terms of disk spaced consumed by backup data. Depending on the type of data
protected (e.g. Office docs vs. Exchange databases, for example) the overall deduplication ratio may change somewhat,
but a good baseline is that 9:1 ratio.
Each Backup Exec Agent for Windows Systems has the built-in capability to do client deduplication, and many of the
Application Agents (Backup Exec Agent for Exchange, Backup Exec Agent for SQL Server, etc.) can also utilize client
deduplication for backup. Note that all deduplication operations require the Deduplication Option to be licensed on the
Media Server. See Table 1 for details.
3
Backup Exec 2010: Deduplication Option
*Note – for Client deduplication with Backup Exec Agent for VMWare (AVVI) and Hyper-V, the Backup Exec Remote Agent
must be installed in the Guest Operating System. In this circumstance, the Remote Agent is responsible for performing
the backup, and not the Agent for VMWare or the Agent for Hyper-V. Symantec recommends that for optimal
deduplication, a Remote Agent also be installed in Hyper-V guests.
Media Server deduplication performs the deduplication processes against data when it arrives at the Media Server – that
is, just before the data is laid down on disk. Data is sent over the network in its whole, un-deduplicated form, and then
decomposed into deduplication segments in-line by the Media Server. Only the unique data (i.e. data the media server
doesn’t yet contain) is written to the Deduplicated Storage Folder. A deduplicated storage folder is special type of backup
to disk folder where all deduplication segments are stored, regardless of whether Client or Media Server deduplication was
used for a specific backup. The Media Server does the bulk of the processing necessary for deduplication. Media Server
deduplication is optimal for situations where:
• The remote system’s processor is fully utilized: If the remote system has no processor cycles to spare for
deduplication calculations, Media Server deduplication can take the load and still perform deduplication.
• The remote system is NetWare, Linux, or Solaris: The Remote Agents for non-Windows platforms do not
have the ability to do client deduplication. These systems can only take advantage of Media Server
deduplication to greatly reduce the amount of storage necessary for backups.
4
Backup Exec 2010: Deduplication Option
• Remote Office Protection over a WAN: With Media Server Deduplication, the Media Server receives the entire
data set before deduplication takes place. This is not a WAN-friendly method of deduplication. Generally,
Remote Office protection without local storage should use Client Deduplication.
Over time, Media Server Deduplication can provide the same level of Deduplication as Client Deduplication, and achieve
deduplication of 9:1 or more for file system backups. This equates to a 90% (or more) saving in terms of disk spaced
consumed by backup data. Depending on the type of data protected (e.g. Office docs vs. Exchange databases, for
example) the overall deduplication ratio may change slightly, but a good baseline is that 9:1 ratio.
Any Backup Exec Media Server that has the Deduplication Option licensed can utilize Media Server deduplication. Most
Agents and backup types supported by Backup Exec can take advantage of the space savings inherent with Media Server
Deduplication.
*Hyper-V backups will experience higher levels of deduplication when a Remote Agent is installed in the virtual Guest
operating system. In this case, the backup is done via the Remote Agent installed in the Virtual Guest. The Agent for
Hyper-V is not involved in Client deduplication. More effective Media-Server based Hyper-V and VHD deduplication will be
delivered in a future Backup Exec release or service pack.
5
Backup Exec 2010: Deduplication Option
and restores when Appliance Deduplication is used. Appliance Deduplication stores all backup data on the specific
appliance.
The Deduplicated Storage Folder contains two distinct items. The first is a file storage location, where the physical
deduplication segments are stored. The second is a PostgreSQL database, where the deduplication segments are tracked
and maintained. By default, the file storage location and the Postgres SQL database are installed to the same location.
While many customers will be able to use these defaults, some customers will need to change them for performance
reasons.
Symantec recommends that customers who use Media Server Deduplication should use separate physical volumes for the
File Storage Location and the Database location. Because the media server is responsible for all aspects of deduplication
with Media Server deduplication, this allows higher scalability for Media Server deduplication jobs by spreading disk read
and writes across multiple physical disks. This same recommendation stands for highly scaled Backup Exec
implementations with 10’s or 100’s of concurrent Client Deduplication backups as well.
Due to periodic weekly database maintenance routines, the Backup Exec Media Server requires double the database size
available on disk. This is because automated database maintenance routines involve making a backup copy of the
database. In the example above, where the deduplication database is 50 GB, the volume holding the deduplication
database needs to be at least 100 GB in size to account for maintenance activities alongside normal operation. For data
that gets manually removed, space reclamation is automatically queued for processing twice a day.
Client Deduplication performs the bulk of deduplication calculation on the client (or source) system. The client
deduplication process will consume as much of one (1) core of one processor as it can on that client system. While the
actual amount of processor utilization will depend on the amount of data to be deduplicated and the speed of the
processor, expect to see at least 75% processor utilization for that processor core for the duration of the Client
Deduplication backup.
Media Server deduplication performs the bulk of the deduplication calculations on the Media Server system. Similar to
client deduplication, the Media Server deduplication process will consume as much of one (1) core of one processor as it
can on that Media Server system. While the actual amount of processor utilization will depend on the amount of data to
be deduplicated and the speed of the processor, expect to see at least 75% utilization for that processor core for the
duration of any Media Server deduplication backup job. For both Client and Media Server deduplication, initial backup
6
Backup Exec 2010: Deduplication Option
jobs will be the slowest. As time goes on, backup speeds increase as more database fingerprints are created within the
database.
For Remote Agents that cannot use Client Deduplication (Remote Agent for Linux and Unix [RALUS], Remote Agent for Mac
Server [RAMS], Remote Agent for NetWare [RANW], etc.) there is no change to system requirements as outlined in the
Backup Exec 2010 Administration Guide. This also holds true for Windows Remote Agents that do not choose to use Client
Deduplication.
Memory requirements for clients using Client deduplication are not very stringent. Symantec requires 1 GB of physical
memory on each individual client that uses Client Deduplication.
Disaster Recovery with Backup Exec Deduplication Option (Client and Media Server
Deduplication)
Backup Exec offers two convenient disaster recovery options for customers who use the Deduplication Option and Client
or Media Server deduplication:
The built-in writer is available in the Backup Exec product at no additional charge. The writer is used to quiesce the
Backup Exec Deduplication Storage Folder and the associated PostgreSQL database so a consistent backup can be taken
to tape or another disk location. This process copies the entire Deduplication Storage Folder and associated databases to
tape in its entirety and not only useful for disaster recovery of the entire Backup Exec Media Server system. At no other
point does deduplicated data get stored on tape. Refer to the Backup Exec Administration guide for additional
configuration details. Symantec recommends periodic protection of the Deduplicated Storage Folder. While many
customers will be able to use the built-in writer for backups to tape, larger customers should examine the Deduplicated
Backup Set Copy process detailed below.
The ability to perform Deduplicated Backup Set Copies is also built into the product and is available at no additional cost.
The data is replicated in a deduplicated format and requires a second Deduplication Option license, along with the Central
Administration Server Option (CASO), to be purchased for the second media server. Backup Exec will be fully aware of the
secondary copy which allows for easy recovery of data in case of disasters. The Deduplicated Backup Set Copy process
allows complete independence from tape if the customer chooses to do so.
7
Backup Exec 2010: Deduplication Option
• 64-bit version of Windows: The Deduplication Option requires that the media server runs a 64-bit version of
Windows.
◦ Windows 2003 x64 (all supported versions including R2), Windows 2008 x64 (all supported versions),
Windows 2008 R2 x64 (all supported versions).
• Physical Memory: The Deduplication Option requires 1 GB of physical system memory per Terabyte of
Deduplicated storage for Client or Media Server deduplication operations, in addition to Backup Exec’s
minimum memory requirements
◦ For example, if 5 TB of deduplicated data is stored on a media server, the media server must have at
least 5.5 GB of physical memory.
◦ Customers who use Appliance Deduplication *only* are exempt from this physical memory
requirement. These customers need to have at least 1 GB of physical memory installed on the Media
Server.
• Processor: The Deduplication Option requires at least one dual-core processor for the media server. Two dual-
core or one or more quad-core processors are recommended.
• Processor: For Client Deduplication, a Windows Remote Agent must have at least one dual-core processor
• Physical Memory: For Client Deduplication, a Windows Remote Agent must have at least 1 GB of physical
system memory.
Note that Remote Agents can be either 64-bit or 32-bit versions of the platforms that Backup Exec supports.
The Media Server has other detailed requirements listed in the Backup Exec 2010 Administration Guide. Be sure to refer
to the Administration Guide before configuring Deduplicated Storage Folders.
Appliance Deduplication
Many Backup Exec customer environments have an existing investment in deduplication appliances for onsite backup,
offsite storage (disaster recovery), and remote office protection. If you already have an investment in these devices, or
plan to do so because of those devices’ scalability, speed, or feature sets, then Appliance Deduplication is an excellent fit
for your environment.
8
Backup Exec 2010: Deduplication Option
Several valued Symantec partners produce deduplication appliances. These devices are purpose-built hardware and
software devices that excel at doing a specific set of tasks in the most efficient and effective manner possible.
Deduplication is one such task that these appliances do very well. Many of these appliances also include a built-in
replication capability, where data can be backed up onsite then replicated in an efficient manner to another appliance in
an offsite location. This replication is another area where appliances can excel. Symantec has taken an embrace and
extend strategy by partnering with hardware deduplication appliance vendors to add value. Symantec’s Appliance
Deduplication feature allows Backup Exec to be aware of, and control, both the Deduplication and Replication functions of
the 3rd-party hardware appliance.
Appliance Deduplication uses Symantec’s OpenStorage (OST) technology, in conjunction with both a 3rd-party
deduplication appliance and a manufacturer-developed OST Plugin, to achieve the following:
• Faster backups: Generally, the OST interface, due to its stream-based nature, is faster than typical NFS and
CIFS-based backup targets. Additionally, deduplication appliances are tuned specifically for fast deduplication
calculations and may be faster than software-based solutions. Some manufacturers have seen 30% to 50%
improvement in backup speed when OST-based transfers are used instead of native CIFS/NFS transfers.
• Optimized Duplication: Many 3rd party deduplication appliances include a replication feature, where backed-
up data is efficiently moved from one device to another downstream device. Optimized Duplication is a feature
that allows the Backup Exec Media Server to track backups regardless of where the deduplication appliance may
replicate them, so any Backup Exec media servers that share catalogs through the Central Admin Server Option
(CASO) can be aware of data that lives in several places. Without Optimized Duplication support, Backup Exec
would not be aware of replicated copies of data without a manual inventory and cataloging operation. Backup
Exec’s Granular Recovery Technology (GRT) can also be used in conjunction with Optimized Duplication. This
patent-pending technology allows administrators to restore granular items in Exchange, SharePoint or Active
Directory with a single pass backup. However, the Optimized Deduplication (downstream) copy of the backup is
not in traditional GRT format. The backup set itself is GRT-enabled. In order to perform a GRT recovery from an
Optimized Deduplication copy of a GRT backup, the Optimized Deduplication copy must be staged to a disk
location before recovery of individual items can take place. Each copy of data, whether onsite or offsite, can
have different retention periods.
Refer to Figure 2 below for a graphical view of the Optimized Duplication process.
9
Backup Exec 2010: Deduplication Option
The primary benefits for Optimized Duplication in conjunction with Appliance Deduplication are straightforward.
Operational efficiencies are improved because the hardware device does the actual deduplication calculations, off-loading
these processes from the Media Server. Deduplication performance can also be increased, as specific deduplication
hardware devices usually perform deduplication calculations faster than a similar task done in. Lastly, and perhaps most
importantly, the Backup Exec Media Server has cataloged and is aware of backup sets that have been replicated to another
deduplication appliance. This allows customers to set separate retention periods for the various backup sets allowing for
different retention periods for different sites Symantec’s OST technology and Optimized Duplication has received positive
reviews in the Storage industry; W. Curtis Preston of the technology blog Backupcentral.com has a good write-up
regarding OST here: http://www.backupcentral.com/content/view/198/47/.
Appliance Deduplication requires that the Backup Exec Media Server be paired with one or more supported OST-based
deduplication appliances. At Backup Exec 2010’s launch, both ExaGrid and Quantum have (see the Backup Exec 2010
Hardware Compatibility List at http://support.veritas.com/docs/329256) devices that support basic backup and recovery
through OST. ExaGrid EX devices support Optimized Duplication. Symantec Backup Exec is committed to expanding the
breadth and depth of OST partners certified to work with Backup Exec, so additional OST devices are being certified and
supported as they complete Backup Exec’s internal qualification processes.
10
Backup Exec 2010: Deduplication Option
breaks out possible use cases and environments and Symantec’s recommendations for deduplication methodologies.
Symantec recommends that Deduplication be used as follows:
Remote Linux, Solaris, or Mac Servers Media Server Deduplication or Appliance Deduplication
Applications on Linux (SAP, Oracle, etc) Media Server Deduplication or Appliance Deduplication
NetBackup PureDisk 6.6 is a robust and scalable solution for customers that need to store more than 16TB of
Deduplicated data in a media server. Currently a single NetBackup PureDisk 6.6 server can support 32TB of Deduplicated
data, and additional PureDisk servers can be joined into a pool of PureDisk storage, to exponentially grow capacity to
hundreds of TB’s of Deduplicated data, while also improving throughput, and adding redundancy to your backup
environment.
The ability to do Client Deduplication with a combination of Backup Exec and PureDisk is possible because the Backup
Exec Agents contain much of the same logic that is contained by a stand-alone PureDisk client. This has some side effects
that customers will need to be aware of – namely, that the PureDisk client and a Backup Exec Remote Agent cannot exist
on the same system.
11
Backup Exec 2010: Deduplication Option
This architecture opens up several interesting possibilities with both PureDisk and Backup Exec in the same environment.
Since PureDisk can accept backup data directly from BE Remote Agents or from data backed up through the Media Server,
customers who need the performance and scalability offered by PureDisk can combine the ease-of-use of Backup Exec
with the robust PureDisk back end. In fact, a mix of NetBackup clients, PureDisk clients, and Backup Exec Remote Agents
can all backup data to a single PureDisk data store.
Additionally, through OST integration, Backup Exec can control and manage PureDisk’s replication of its backup data
between PureDisk content routers and Storage Pools. This function is nearly identical to the Optimized Duplication
feature described above. This way, multiple Backup Exec servers with the CASO can be aware of backup data contained in
PureDisk Storage Pools regardless of where customers need to replicate data around their environment. See Figure 3
above for a description of Optimized Duplication and the benefits of using Optimized Duplication technology.
These Set Copy jobs are separate jobs, but aside from the typical job configuration information needed to create a job,
Backup Exec handles the details of copying deduplicated data to a tape location. It’s really as simple as creating a Set
Copy job with a tape device as the destination, and run it.
For data that was backed up using Client or Media Server deduplication, the Media Server will be responsible for
recreating whole files from deduplicated data before transferring to tape, so there will be some impact to processor and
memory usage during the Set Copy operation. While resource consumption varies based on data set, at most the Set Copy
process will use 100% of one processor core while performing the Set Copy.
For data that was protected to a PureDisk Storage Pool or to a Deduplication Appliance, that device will be responsible for
rehydrating the data prior to it being sent to tape.
Conclusion
Backup Exec 2010 introduces some excellent new capabilities around storage management through Deduplication. The
Deduplication Option provides customers with the ability to reduce backup storage by 90% or more, improve backup
windows, and facilitate better remote office protection. The Backup Exec Deduplication Option gives administrators the
flexibility to choose what type of deduplication to use – whether it is software-based deduplication with Client or Media
Server deduplication, or hardware-based deduplication with devices from OST partners like ExaGrid and Quantum.
Symantec has taken an embrace and extend strategy by partnering with hardware deduplication appliance vendors to add
value. This is accomplished by integrating management of appliance deduplication and replication operations, and
Backup Exec 2010’s Deduplication Option can leverage these capabilities for advanced multi-site and multi-appliance
recovery operations. The Backup Exec Deduplication Option allows customers complete freedom to use whatever type of
12
Backup Exec 2010: Deduplication Option
deduplication they desire, whether it’s deduplication through a 3rd-party appliance, or a mix of client and media server
deduplication.
13
About Symantec
Symantec is a global leader in providing security,
storage and systems management solutions to help
consumers and organizations secure and manage
their information-driven world. Our software and
services protect against more risks at more points,
more completely and efficiently, enabling
confidence wherever information is used or stored.