Deduplication, Incremental Forever, and the Olsen TwinsDIFATOTWP-20110921-01
What do deduplication, incremental orever, and the Olsen twins have to do with each other?It’s all about duplicate data. Mary Kate and Ashley Olsen are raternal twins. Identical twinsshare the same 100% o their DNA. Fraternal twins share about 50% o their DNA - thesame as any other sibling.I we created a DNA database or a set o identical twins, we would only need to store thatDNA inormation once since we know that the DNA is identical. However, i we created aDNA database or the Olsen twins, we would either need to create two completely uniquesets o data or we would need techniques or understanding and categorizing which data isunique and which data is identical. The techniques or this understanding and categorizingis what data deduplication is.File-level deduplication, block-level duplication, byte-level deduplication, and incrementalorever are all techniques that eliminate duplicate data. Theoretically, the data reductionachieved is identical or each when 100% o data is duplicated. Practically, there aresignifcant dierences in the time and computational resources required between each o these techniques. It’s when only some data is duplicated, as in the case with the DNAo raternal twins, that the data reduction varies. The time and computational resourcesrequired with each o these techniques also varies in this case.In this paper, we’ll compare and contrast the advantages and disadvantages o each o these techniques and explain why incremental orever when combined with byte-leveldeduplication is the superior methodology or reducing redundant data in the most efcientmanner possible. Further, we’ll discuss the advantages and disadvantages o both physicaland virtual backup appliances versus dedicated deduplication devices.
The purpose o deduplication is to reduce the amount o data storage necessary or a givendata set. Deduplication is a wide-ranging topic and is covered at length in our adaptivededuplication whitepaper. In this chapter, we’re going to explore the basic types o storage-oriented deduplication. However, we’re going to start o by discussing a key concept instorage oriented deduplication - content awareness.
Content aware deduplication systems simply means that the deduplication algorithmsunderstand the content - sometimes called the “semantics” - o the data that is beingdeduplicated. Why is this important? The reason has to do with efciency - not the efciencyo data reduction, but the resources (e.g., processor, memory, I/O, etc.) that are required ora given level o data reduction.