You are on page 1of 58

Tom Chandler IBM ProtecTIER Solution Architect - EMEA Storage Technology 2010 - Juni 2010

IBM Deduplication

2010 IBM Corporation

Storage Technology 2010

Deduplication is key to use disk more cost effectively!

2010 IBM Corporation

Storage Technology 2010

IBM Deduplication IBM Delivers Deduplication Solutions


TSM 6.2 Client Dedup Desktop Clients
Dedup Dedup
Storage Manager 6

Dedup Dedup

FastBack Dedup

WAN
LAN

Productive Server

LAN-free Backup Client

Backup Server

SAN
ProtecTIER TS7650 TSM 6.1 Dedup
SVC XIV DS8000 DS 3/4/5* Dedup Dedup Dedup Dedup
Storage Manager 6

Disk N Series Primary Storage

Tape

VTL

Backup/Recovery Storage Corporation 2010 IBM

Disk Buffer

Dedup Dedup

Storage Technology 2010

IBM Data De-Duplication Options


IBM N series VMWare environments, user file spaces IBM Tivoli Storage Manager Data reduction with Incremental Forever Data de-duplication for disk pools Future: client de-duplication! IBM TS7600 family Virtual tape library with ProtecTIER software Hyperfactor industry-leading data de-duplication solution for Open systems (TS7650) System z (TS7680)
Storage Hierarchy

Represented Disk Capacity buffer

Physical Capacity

2010 IBM Corporation

Storage Technology 2010

Backup with TSM Disk Buffer


Client Data reduction before / during backup Incremental forever
LAN

LAN-free Client

Backup Server

TSM disk pools can be de-duplicated More space for critical data Smaller disk pool

SAN

Disk
Represented Capacity Disk-buffer Physical Capacity

Disk

De-duplicated disk buffer Disk Tape


2010 IBM Corporation

Storage Technology 2010

Storage consumption with N series data de-duplication


VI3 Server

N series de-duplication removes redundant VMware data


VMDK VMDK

VMDK

VMDK

Datastore A

Reduce OS & applications to a single copy

Duplicate data removed


VMDK VMDK VMDK VMDK

VMs only consume storage for their unique data Reduce Storage Costs with Virtualization

Flexvol IBM N series System

N series de-duplication provides the same benefits as VMwares shared memory functionality
6 2010 IBM Corporation

Storage Technology 2010

Backup Solution with IBM TS7600


LAN Client

LAN

LAN-free Client

Possible reduction of required disk capacity 1:5 1:25 , Strong dependency on Backup process used Type of data Bandwidth vs cost Backup Compare with using multiple physical drives
Server

Tape

SAN
Disk-buffer

Virtualization

Represented Disk Capacity buffer Disk

Physical Capacity

Disk Tape
7 2010 IBM Corporation

Storage Technology 2010

TS7650G ProtecTIER VTL may be best when


Greater than 6 TB of data backed up nightly Many backup servers, both TSM and other Large disk caching requirements for secondary storage

Heterogeneous tape management

Larger or dedicated storage management staff Prefer an integrated software solution with no specific hardware dependencies TSM manages majority of data on tape Data backed up nightly 6 TB or less Spare resources are available to dedicate to dedup processing Moderately sized TSM server installation

Tivoli Storage Manager may be best when

2010 IBM Corporation

Storage Technology 2010

De-duplication Topologies
1. Post Processing & Inline 2. Hash Based Approach 3. IBM Hyperfactor 4. Case Studies

2010 IBM Corporation

Storage Technology 2010

Two Basic Implementations with Deduplication #1 Inline


As data is received by the target device it is deduplicated in real time not temporarily stored on disk Data written to the disk storage is de-duplicated

#2 Post Processing
As data is received by the target device it is temporarily stored on disk storage

Data is subsequently read back in to be processed by a de-duplication engine


2010 IBM Corporation

10

Storage Technology 2010

The Advantages of (Async Dedupe) Postprocess are:


No concerns about slowing down incoming backup speed Allows for staggered implementation of dedupe Allows you to copy last night's backups in its original format

The Disadvantages of (Async Dedupe) Postprocess are


It does have a lot more I/O work to do. It requires the landing zone disk It requires more configuration than an inline approach It allows the vendor to advertise numbers that aren't quite real It will delay replication to a remote site by minutes or hours, depending on which product we're talking about. Source:http://www.backupcentral.com/content/view/134/47
11 2010 IBM Corporation

Storage Technology 2010

The Advantages of (Sync Dedupe) Inline are:


Less I/O work to perform When you're done, you're done Data can be replicated the second it shows up Simpler configuration No landing zone required

The Disadvantages of (Sync Dedupe) Inline are:


Possibly slow down the incoming backup speed
Source:http://www.backupcentral.com/content/view/134/47
12 2010 IBM Corporation

Storage Technology 2010

Hash-Based Approach

13

2010 IBM Corporation

Storage Technology 2010

Hash Based Approach


1. Slice data into chunks (fixed or variable) A B C D E

2. Generate Hash per chunk and save Ah Bh Ch Dh Eh 3. Slice next data into chunks and compare hashes with table A B C D E

4. Reference data previously stored


14 2010 IBM Corporation

Storage Technology 2010

Assessment for Hash-Based Approach


Applicable for all chunking methods Hash-Collisions must be handled Must be prevented through secondary comparison (additional metadata, second hash method, binary comparison) Requires hash table to store hash of all chunks Hash table will grow with data volume Hash Table must be quickly searchable and accessible Growing hash-table may become performance bottleneck (doesnt fit into RAM) Scalability issues

15

2010 IBM Corporation

Storage Technology 2010

Hashing
MD5 128 bits Sha0 128 bits Sha1 160 bit .... sha384, sha512

16

2010 IBM Corporation

Storage Technology 2010

Hash Collision
Hash Collison(n) a term in computer programming for a situation that occurs when two distinct inputs into a hash function produce identical outputs. The possibility of a hash collision (2 chunks of different data assigned the same hash) is not zero. A 10 TB repository has 1.25 billion 8k blocks, even with a low probability, when you are managing that many blocks for a long time, the likelihood increases.

17

2010 IBM Corporation

Storage Technology 2010

18

2010 IBM Corporation

Storage Technology 2010

19

2010 IBM Corporation

Storage Technology 2010

20

2010 IBM Corporation

Storage Technology 2010

HashCollision

HashIndexsize

21

2010 IBM Corporation

Storage Technology 2010

The Index is Everything


Understanding the Knowledge Base
A key metric is the means used to map the user content Balancing performance vs. capacity With hash schemes the hash for a chunk is remembered an index For example purposes imagine a chunk size of 8KB 1 TByte repository has ~134,000,000 8 KB chunks Each hash (signature) is 20 bytes long Need pointers scheme to reference inside 1 TByte The hashes require 2.9 GBytes of memory no issue With a 100 TByte repository
22

~300 GBytes of memory is required


2010 IBM Corporation

Storage Technology 2010

The Index is Everything


Last but not least - performance
If the index gets too big it must be paged to disk performance crashes to 40-70 MB/sec

A quote from our customer!


The performance of the hash based solution is good until the storage hits a size around 28-30TB. At that point the index needs to be stored on disk because it wont fit in memory. When that happen, our performance tanks to 60 MBs or less. We end up ordering an extra VTL appliance to make up for the loss in performance. This means more administration overhead due to configuration of the extra repository and service costs.

23

2010 IBM Corporation

Storage Technology 2010

The Index is Everything


Last but not least - performance
If the index gets too big it must be paged to disk performance crashes to
40-70 MB/sec

A quote from vendor!


Over a year ago, we (NEC) actually conducted tests with Hash-based vendor and verified that their maximum stated numbers were indeed *only* applicable for 100% duplicate data. For 0% duplicate data (i.e. all data actually written to disk), the max throughput numbers we measured with the same system were 45% lower.

24

2010 IBM Corporation

Storage Technology 2010

Sizing Requirements for Hash Based Solution

25

2010 IBM Corporation

Storage Technology 2010

ProtecTIER versus Hash Based Real World Example


Big Financial Companys Environment and Issues NetBackup Weekly 100TB fulls and 20TB incrementals nightly 60-day retention Data type: File system, Database and Exchange Pain points: Backup window too short and shrinking Recovery from tape too slow - Wanted to use disk for short term recoveries Tried using straight disk but too expensive and managing file systems was painful Tried NBU PureDisk but it slammed performance of backup servers Review the following actual proposals and youll see why this customer choose IBMs enterprise-class solution over hash-based approach attempt The following is an example of a real customer that needed a deduplication to improve their backup and recovery environment and how both Hash-based Vendor and IBM proposed to solve their problems

26

2010 IBM Corporation

Storage Technology 2010

Hash-Based Vendors Proposed Solution:


Four separate Appliances No global deduplication No centralized management 4 head units and 24 shelves No failover no clustering

35TB

35TB

Backup Servers

Clients
35TB

35TB
27 2010 IBM Corporation

Storage Technology 2010

IBMs Proposed ProtecTIER Solution:


1000MB/s Global deduplication Centralized management Semi-automated failover Future scalability to 1PB

Dual node TS7650G Cluster with an IBM Storage Array


Backup Servers

Clients

Actual Results: Full 100TB backup: 36 hours 809MB/s Incr 20TB backup: 7 hours 832MB/s

28

2010 IBM Corporation

Storage Technology 2010

IBMs PT Large Repositories: > 100 TB


AT&T 900TB repository Humana - 200TB repository JCP - 300TB repository MetLife - 480TB Pepsi - 400TB Thomson Reuters - 405TB UPMC - 200TB Vanguard - 250TB Lloyd's - 256TB Mapfre - 128TB ABSA 250TB NAB (National Australia Bank) 750TB
Dual node TS7650G Cluster with an IBM Storage Array
Backup Servers

29

Clients

2010 IBM Corporation

Storage Technology 2010

VTL-Dedupe Hybrid (Differential) Approach (Inline)

TS7650 (Open Systems & System i)

TS7680 (z-OS)

30

2010 IBM Corporation

Storage Technology 2010

HyperFactor Approach
1. Locate data in a backup stream similar to content stored in repository New Data Stream 2. After locating similar content, retrieve existing content from repository and run byte level check between existing and incoming data Element A Element B Element C

3. Matches factored out unique data added to repository

31

2010 IBM Corporation

Storage Technology 2010

HyperFactor Approach
HyperFactor has two indexes HyperFactor index used for backup Fixed size of 4 GB, stored in memory Contains most similar data elements Used to filter out similar elements from data stream Restore Index used for restore Dynamic index, growing Includes reference of de-duped objects Stored on disk system
32 2010 IBM Corporation

Storage Technology 2010

Assessment for HyperFactor


No Hash Table required No scalability issues 4 GB index references 1 PB of physical data No dependency of data format and application Very flexible HyperFactor index always fits in memory Enables in-band de-duplication Eliminates the phenomenon of missed factoring opportunities Looks for similarity between data not on exact chunk matches
33 2010 IBM Corporation

Storage Technology 2010

Inside ProtecTIER TS7650G


New Data Stream

HyperFactor

Repository

Memory Resident Index


Disk Arrays

FC Switch

TS7650G

Existing Data

Filtered data Backup Servers


34 2010 IBM Corporation

Storage Technology 2010

ProtecTIER Conceptual Flow Chart New backup data


Locate pattern (RAM search) Memory index

Read similar pattern from repository


2:1

Repository data LZH Store new backup delta


2010 IBM Corporation

Compute delta
35

Storage Technology 2010

ProtecTIER Conceptual Flow Chart


When data is written from the backup application to VTL, the VTL receives the write command and the data, in this case LER001. The data is stored on the disk in segments and the VTL uses a metadata file, or database, to keep track of each segment of data. When the VTL recieves a read command from the backup application, it will use the metadata or restore index to retrieve and present it to the backup application as sequential data.

36

2010 IBM Corporation

Storage Technology 2010

HyperFactor Approach IBM TS7650


Most de-duplicaton products are based on a simple standard cryptographic hash algorithm Most use SHA-1 (160 bit) hashing algorithms A hash is generated for each chunk of data For each TB of repository, the index grows approximately 3 GBs If the index gets too big it must be paged to disk performance crashes to 40-60 MB/sec. IBM Diligent first tried hash technology and discovered its limitiations Hyperfactor developed by IBM Diligent engineers and Israeli mathematicians A bit-for-bit comparison (differential) is used to guarantee unqiue data is not discarded Only 4 GB is needed for a Petrabyte repository Small Index remains in memory Steady non-degrading performance of 600 MB/s or more, cluster solution equals 1000 MB/s
37 2010 IBM Corporation

Storage Technology 2010

ProtecTIER Clustering Overview


Active-Active 2 nodes cluster (architecture will allow for increasing node count over time) Full repository sharing among nodes Writing data to the repository Reading data from the repository (restore and read reference) Access to all virtual devices No degradation on HyperFactor efficiency (regardless of the node through which the data is received) Minimum cluster down-time

38

2010 IBM Corporation

Storage Technology 2010

System Overview: Clustered ProtecTIER Nodes


PT Manager Media Server
Cluster: MyCluster Node A Node B

Single Virtual Library Image

Network

Repository
CFS Metadata files STU data files
39 39 11-Jun-10
2010 IBM Corporation

Storage Technology 2010

Hardware Deployment Diagram TS7650G


Two Node Cluster Configuration
Power Switch ProtecTIER Gateway 1 ProtecTIER Gateway 2

Cluster Internal Network Switch 1

Cluster Internal Network Switch 2

Storage Fabric

Disk Arrays

40 40

11-Jun-10

2010 IBM Corporation

Storage Technology 2010

Control Path Failover in Dual-Node Configuration


ALL PATHS TO NODE LOST WITH CPF

HOST

CPF
ProtecTIER Server ProtecTIER Server

Virtual Accessor Virtual Drives


41

Unavailable Unavailable

Active Available
2010 IBM Corporation

Storage Technology 2010

42

2010 IBM Corporation

Storage Technology 2010

Data Storage Repository

RAID-10
HyperFactor Index Virtual Volume files Library Configuration Data Storage Management Data

Repository
Metadata Metadata Metadata Metadata Metadata Metadata

RAID-5
User Data from Backup Application

User Data

User Data

User Data User Data User Data User Data

43

2010 IBM Corporation

Storage Technology 2010

ESG Lab Valdiation Tests - 2009

44

2010 IBM Corporation

Storage Technology 2010

IBM TS7650 ProtecTIER Deduplication Family


TS7650G Gateways
Highest Performance largest Capacity High Availability High Performance High Capacity Flexible Storage

TS7650 Appliance
Highest Performance Highest Performance Largest Capacity Largest Capacity Better Performance Better Performance Larger Capacity Larger Capacity Scalable Scalable Good Performance Highly Scalable Low cost

Highest Performance Largest Capacity High Availability

Sca

Ca lable

ance rform d Pe ity an pac


Active-Active Cluster Up to 500 MB/sec

Active-Active Cluster Single Node Up to 500 MB/sec 1 PB TB useable Up to 1000 MB/sec 1 PB TB useable

Up to 500 MB/sec 36 TB useable Up to 250 MB/sec Up to 100 MB/sec 7 TB useable


45

36 TB useable

18 TB useable

2010 IBM Corporation

Storage Technology 2010

IBM System Storage TS7650G ProtecTIER Deduplication Gateway


Inline performance - sustainable at 1000 MB/sec Up to 25:1 Data Reduction Scalable to 1PB physical capacity 20:1 = 20PB Nominal Single Node and cluster configuration IBM HyperFactor; industry leading inline deduplication Enterprise Class Data Integrity LTO drive emulation Designed for performance scaling IBM & Non-IBM disk support DS3XXX DS4XXX DS5XXX DS8XXX XIV SVC N-Series EMC HDS AMS1000 / USP EMC CX HIGH Performance HP EVA
Only Inline High Availability Solution in the Market Today

TS7650G Gateway

3Q08

Clustered TS7650G Gateway

Data Deduplication
2010 IBM Corporation

46

Storage Technology 2010

End to End Data Deduplication for System z


Supports standard Tape Applications

Emulates an IBM Tape Library

Deduplicates with ProtecTIER

Stores data on a variety of disk storage

TS7680

Disk Cache

FICON FICON Switch/Director Switch/Director

TS3500

VTL Deduplication System z Tape Storage (Less active data)

Comprehensive solution builds on IBM z/OS, tape, tape virtualization and ProtecTIER deduplication
47 2010 IBM Corporation

Storage Technology 2010

Replication with ProtecTIER


Backup Server Represented capacity

Primary Site
ProtecTIER Gateway

Physical capacity Backup Server PT-server based replication

Significant bandwidth reduction

Represented capacity

Secondary Site

Backup Server

ProtecTIER Gateway

Physical capacity

48 48

11-Jun-10

2010 IBM Corporation

Storage Technology 2010

ProtecTIER Many-to-one Replication Overview


Up to 12 Branch Offices (spokes): Gateways and/or Appliances 1 target (hub): Appliance, Gateway, single or two-node cluster

IP based NR links

Hub repository includes local backups and remote DR copies

Backup Server

ProtecTIER Gateway

Physical capacity

Virtual cartridges can be cloned to tape by the Main-Site B/U server Tape library

Central / DR Site
49

Protect More. Store Less.

2010 IBM Corporation

Storage Technology 2010

IBM Data Deduplication Case Studies

50

2010 IBM Corporation

Storage Technology 2010

Lloyds Banking Group formerly Halifax Bank of Scotland (HBOS)


Business challenge
Lloyds created one of the largest SANs in Europe. Nightly backup numbers exceed 1,000TBs and they have more than 5PBs of centrally managed storage. Faced with shrinking backup windows, backup failures and data growth at 55% CAGR, contributed to them evaluating a new disk-based backup and recovery infrastructure. This client chose IBMs TS7650G data deduplication solution over final contender Data Domain. This deal is now IBMs largest ProtecTIER installation across Europe.

Solution
10 TS7650G ProtecTIER Deduplication Gateways

Benefits
Executes backups to disk with a retention of 180 days providing faster backups and even quicker restores Saved over 100+ square meters of floor space by eliminating tape libraries through this implementation Off-site backups are no longer needed. Data is electronically copied and replicated safely and efficiently Enables customer to re-use existing disk infrastructure
51

IBMs TS7650G ProtecTIER seamlessly integrated into an existing backup environment using TSM, removed the complexity of failed backup and restores and will help them contain the growth rate of their data sets
Protect More. Store Less.
2010 IBM Corporation

Storage Technology 2010

52

2010 IBM Corporation

Storage Technology 2010

SEB Bank
Business challenge
SEB wants to be tapeless with incremental data with total 4.2 PB in 2-3 years. SEB had already moved to a disk-based backup and recovery solution using hash-based VTL solution but was hampered by their inability to provide performance, scalability, and capacity to meet their backup and recovery requirements. As new datasets were added and their environment continued to grow, performance and capacity suffered. With the current VTL appliances, their only choice was to keep adding appliances to try and solve the problem. SEB decided not to invest any more time and money and opted for IBMs TS7650G deduplication solution in a clustered configuration to have a more robust, dependable solution that could guarantee performance and scalability.

Solution
IBMs ProtecTIER TS7650G (6 in a clustered configuration) IBM DS8000 disk arrays (2)

Benefits

Provides industry-leading performance, scalability and availability with true global deduplication technology IBM provided SEB a solution of 6 TS7650 Gateways vs. Hash-Based Vendors 34 appliances to handle the same amount of data Enables SEB to manage their environment holistically and will enable SEB to meet their goal of going tapeless in the next 2-3 years
53

With industryleading performance, scalability and capacity, ProtecTIER continues to exceed expectations on meeting customer requirements of all sizes

Protect More. Store Less.


2010 IBM Corporation

Storage Technology 2010

TS7650G Dual Cluster Nodes / Standby SEB Bank Stockholm, Sweden


Legato Networker 7.5 1500 Clients Solaris 10 TS7650G Standby (Grytet Node 1 PT Test Server Legato Networker 7.5Standby 1000 Clients

Dark Fibre ) Rissne SAN >20km

Grytet SAN

TS7650G Standby (Rissne Node 1

TS7650G
Rissne Node 1

60TB Prime 60TB Mirror

60TB Prime 60TB Mirror MetroMirror (Bi-Directional) 1TB P 1TB P 1TB M

TS7650G
Grytet Node 1

Rissne Node 2

1TB P 1TB P 1TB M

Grytet Node 2

60 TB Repository

60 TB Repository

DS8700
54 54

DS8700
11-Jun-10
2010 IBM Corporation

Storage Technology 2010

Some customer experiences with IBM TS7650


Helios
Dedupe rate 12:1 8 TB usable 96TB nominal Databases, files, emails Backup server: Backup Exec Backup Restore/Requirements 300 MB/s

Hilti Dedupe rate 16:1 30% databases (SQL, Oracle and SAP), 70% files Retention time: 21 days = Files, 3 months = DB Backup server: Netbackup Backup Restore Requirements 400 MB/s Ekom21 Dedupe rate 8:1 OS, files (Incremental Forever) , emails, mySQL,Oracle, Informix Daily full backups/Incrementals Backup software: TSM 6.1 Backup/Restore Requirements 300 MB/s Cartridge level IP replication
55 2010 IBM Corporation

Storage Technology 2010

Key Facts IBM ProtecTIER VTL


Launched in Q4 2005 - First VTL with Deduplication Installed in all major industries Over 1000 systems in production In EMEA Current Install Base May 2010 Shipped 205 Clusters
Average Cluster Repository 58.3TB

Shipped 120 Single Nodes


Average Single node Repository 39TB

Shipped 60 Appliances Open Systems, AS/400, z/OS Support Disk-Based and IP-Based Replication Support
56 56
2010 IBM Corporation

Storage Technology 2010

Questions?

Merci! Danke!

57

2010 IBM Corporation

Storage Technology 2010

Notices and Disclaimers


Copyright 2008 by International Business Machines Corporation. All rights reserved. No part of this document may be reproduced or transmitted in any form without written permission from IBM Corporation. Product data has been reviewed for accuracy as of the date of initial publication. Product data is subject to change without notice. This document could include technical inaccuracies or typographical errors. IBM may make improvements and/or changes in the product(s) and/or program(s) described herein at any time without notice. Any statements regarding IBM's future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in which IBM operates or does business. Any reference to an IBM Program Product in this document is not intended to state or imply that only that program product may be used. Any functionally equivalent program, that does not infringe IBM's intellectually property rights, may be used instead. THE INFORMATION PROVIDED IN THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NONINFRINGEMENT. IBM shall have no responsibility to update this information. IBM products are warranted, if at all, according to the terms and conditions of the agreements (e.g., IBM Customer Agreement, Statement of Limited Warranty, International Program License Agreement, etc.) under which they are provided. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. IBM makes no representations or warranties, expressed or implied, regarding non-IBM products and services. The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents or copyrights. Inquiries regarding patent or copyright licenses should be made, in writing, to: IBM Director of Licensing IBM Corporation North Castle Drive Armonk, NY 10504-1785 U.S.A.

58

2010 IBM Corporation

You might also like