You are on page 1of 3

Tech2Tech_applied solution 2

New Options
for Compression
Reducing data volume puts the squeeze on system cost.

he Petabyte Era has arrived. A digital record of


everything imaginable is
continuously streaming into
a data warehouse somewhere. If one of those data
warehouses is yours, you
are most likely looking for ways to reduce
the cost of storing all of this data, as well as
trying to figure out new ways to optimize
access to it.

The Benefits

tried-and-true method for saving space


is data compression, which has been
around since the middle of the last century,
giving it plenty of time to improve and mature.
The major benefits of compression are that it:
> Decreases the cost of storage by requiring less mediaand it frees up space for
storing even more data
> Improves system I/O performance by
minimizing data movement between
storage and memory
> Reduces I/O by enabling the placement
of more data in cache memory
While the storage capacity of disks has

by Steve Long

grown rapidly, data transfer rates from disks


are still limited by mechanical constraints.
This, in addition to ever-increasing CPU
power, results in many data warehouse workloads becoming I/O-bound. The idea behind
data compression is to pay an assumed cost
in increased CPU utilization for the benefit
of smaller databases and reduced I/O to
improve throughput.

Options Available

eradata 13.10 includes a flexible set of


compression mechanisms to meet a variety
of circumstances and goals within your data
warehouse environment. Compression rates of
up to five times can be achieved. In other words,
data can be reduced to as little as 20% of its
original size.
The compression options included with
Teradata 13.10 are:
n Multi-Value Compression
Teradata has included multi-value compression (MVC) in its databases for several years.

PAGE 1 l Teradata Magazine l Q4/2010 l 2010 Teradata Corporation l AR-6264

Tech2Tech_applied solution 2

How It Works
Most data compression techniques are
based on one of two models:

Statistical. Each distinct char-

acter of data is encoded, with


the code assignment based on
the probability of the characters
appearance in the data. The common
method of compressing data with
this model is to use a compression
algorithm, which accesses uncompressed data, examines its characteristics, and looks for and eliminates
repeating patterns. This reduces the
actual quantity of data by minimizing redundancy.

Dictionary-based. A list (a

dictionary) of commonly
occurring values in the data and
their corresponding codes is maintained. Because these values usually
comprise the highest percentage of
all values in a column, this type of
compression condenses them, thereby
optimizing disk space.

MVC is best used for data with repeated


values. It has a side benefit of improving
performance while saving disk space and I/O,
with virtually no CPU cycles required for
compressing or decompressing.
MVC is a dictionary-based compression
that replaces values specified by the user with a
compact bit pattern. When the data is accessed,
the bit pattern is used to look up the original
value in the list (or dictionary).
This type of compression does not apply
to all data and requires an understanding
and analysis of the data to find the list of
common values to compress. Although its
fairly easy to apply via the CREATE TABLE
statement, it does require up-front data

Compression rates of up to five times can


be achieved. In other words, data can be
reduced to as little as 20% of its original size.
analysis to be effective. The benefit is that
it can attain very favorable compression
rates with virtually no CPU cost.
For the character data type, before Teradata
13.10, MVC was restricted to fixed-length character columns (CHAR data type) and columns
up to 255 characters wide.
For Teradata 13.10, MVC was enhanced so it
can now be used with variable-length character
columns (VARCHAR data type) and columns
up to 510 characters wide. It can support any
numeric type, all character data, GRAPHIC,
VARGRAPHIC, BYTE and VARBYTE.
The following shows the creation of a table
called Customer and gives instructions to compress the customer address column using MVC.
CREATE TABLE Customer
(Customer_Account_Number INTEGER
,Customer_Name VARCHAR(50) COMPRESS
(Joe,Mary)
,Customer_Address CHAR(200));

With MVC, the common American


names Joe and Mary (known to repeat
many times in this particular table) are each
assigned a small bit pattern, thereby liberating the space they would have otherwise
consumed. The bit pattern is then used for
access by queries or applications.
n Algorithmic Compression
Algorithmic compression (ALC), a new compression mechanism for the Teradata Database,
is a good choice for compressing data with
well-known attributes. It applies a compression
algorithm to a column of data for compression and a matching decompression algorithm
when the data is used. Exceptional results can
be achieved on data with a well-understood
structure, which can be compressed using an
algorithm of your choice. For example, an al-

PAGE 2 l Teradata Magazine l Q4/2010 l 2010 Teradata Corporation l AR-6264

gorithm built into Teradata 13.10 can compress


Unicode data (two-byte characters) up to a
factor of two times when much of the data may
require only one byte.
Among the ALC algorithms packaged with
Teradata 13.10, the one mentioned in the
previous paragraph is designed for Unicode data
columns. A second algorithm is included for
Unicode and Latin data. Furthermore, the ALC
infrastructure is open so you can add algorithms
of your choice. If you have data that compresses
well by a particular algorithm, you can install the
algorithm just as you would a user-defined function (UDF) and apply it to the data columns by
including the UDF name in the CREATE TABLE
statement when defining the table.
Another benefit is that if ALC and MVC are
defined on the same data column, they will
work together. ALC compresses those values
not compressed by MVC, which may result in
better compression rates than either method
would achieve by itself.
ALC does have a CPU trade-off: CPU
resources are used to execute the compression and decompression algorithm. The
amount of CPU used depends on the algorithm being employed and the frequency the
data is accessed. Therefore, unless the system
has excess (i.e., unused) CPU capacity, ALC
will impact system performance. The code
below illustrates the creation of a table called
Customer and gives instructions to compress the customer address column using the
packaged TransUnicodeToUTF8 algorithm.
CREATE TABLE Customer
(Customer_Account_Number
INTEGER,
Customer_Name VARCHAR(50),
Customer_Address CHAR(200)
CHARACTER SET UNICODE
COMPRESS USING TransUnicode
ToUTF8
DECOMPRESS USING TransUTF8To
Unicode);

table

Characteristics of Compression Options


Multi-Value
Compression (MVC)

Algorithmic Compression (ALC)

Block-Level Compression (BLC)

Analysis required

Need analysis data for


repeat values

Can use packaged or build userdefined compression algorithms to


match unique data patterns

Turn on for all data on system or apply


selectively on a per-table basis

Ease of definition

Easy to apply to wellunderstood data columns

Easy to apply with CREATE TABLE

Set once and forget

Flexibility

Works for a variety of data


and situations

Automatically invoked for values not


replaced by MVC

Automatically combined with other


compression mechanisms

No CPU usage

Depends on compression algorithm


used

Performance impact

Breadth of applicability

Replaces common values

n Block-Level Compression
This compression mechanism operates on all
types of data. It compresses all of the data in a
data block before its stored on a disk. It can be
applied to all tables in the system or on a table-bytable basis, but it cannot be applied to only select
columns in a table (such as ALC and MVC).
Block-level compression (BLC) can
generally achieve the highest compression
ratesup to five times, which is a reduction
of the data to as little as 20% of its original
size. It can also have a significant overall
savings in kilobytes transferred per I/O. But
it can have a more significant trade-off in
CPU utilization than the other compression
methods. BLC uses significant CPU on dataload operations and queries. To understand
the magnitude of the potential impact, expect
about 80 CPU seconds per gigabyte for
compressing data and about 10 CPU seconds
per gigabyte for uncompressing data. So if
your system is CPU-bound, use BLC only on
data that will not be accessed during critical
periods of the day, and make sure the session
accessing these compressed objects runs at a
low priority.
In the latest Teradata platforms, BLC makes
excellent use of the more powerful processors.
It also makes very effective use of any excess

CHAR, VARCHAR, BYTE, VARBYTE

CPU power that might be available. Data with


a low frequency of access (cold data) is an
ideal candidate for BLCit can reduce storage costs with little CPU utilization impact
because of the infrequent access.
You compress data with BLC by using either
of two methods:
n Query Band
When loading data that you want to compress into an empty table, use a query band
option to indicate the data blocks that you
want to compress.
To load compressed data into a table:
Turn BLC ON
SET QUERY_BAND = BLOCKCOMPRES
SION=YES; FOR SESSION;
Insert into empty table
INSERT INTO target_table
SELECT * FROM source_table;;
Turn BLC OFF
SET QUERY_BAND = BLOCKCOMPRES
SION=NO; FOR SESSION;

n FERRET Tool
Use the file system utility FERRET to compress or uncompress existing data blocks
in an already loaded table. If you have an
existing uncompressed table that you want

PAGE 3 l Teradata Magazine l Q4/2010 l 2010 Teradata Corporation l AR-6264

Reduced I/O due to compression of


data blocks
CPU cycles used for compression/
decompression
All data, block at a time, for specified
tables

to compress, you can issue the designated


FERRET commands to compress that table.
The FERRET syntax used to compress or
uncompress a table is:
COMPRESS
full-table-name
UNCOMPRESS
full-table-name
where
full-table-name takes the form:
databasename.tablename
Example:
Compress TPCD.LineItem

System Growth

ll of these compression mechanisms can


work together to provide choices for a
variety of situations. They range from compressing a single column of data to powerful,
across-the-board compression. Additionally,
all of these compression mechanisms are functionally transparent to SQL and applications.
By carefully selecting and using these techniques, you can continue to grow your system
while maximizing storage space utilization and
optimizing performance. T
Steve Long is the product manager for
Teradata 13.10 and has been involved in
data warehousing and related technology
for 15 years.

You might also like