Σ KEYSUM

© 1997 Data Management & Warehousing

In addition to this the result can incorporate the length of the original string which improves the uniqueness of lower order algorithm results considerably. HOW DOES THIS HELP? The table key is now an integer, the optimal format on which to index. The user now calls a function to convert the required string into the checksum and uses the index to look up the appropriate row. On very large tables this is considerably faster than conventional string look-up. Furthermore the data can be validated, as, if the current checksum differs from the stored checksum then the data has changed. This also works when reloading data, as any existing data will still be able to reference the old key. It should also be noted that when a field within the key is altered the key also needs to be re-generated. If this technique is used in contexts such as trend analysis within a Data Warehouse it is also possible that the occasional mis-match because of a duplicate checksum will not be statistically significant and therefore the key can be considered unique. WHAT ARE THE ISSUES? No checksum is guarantied to be unique. It is therefore possible that two different records can return the same value. If the length is included in the checksum it is still not guarantied but it further reduces the risk. When choosing a checksum algorithm it is important to consider the amount of records for which the checksum will provide a key. If you have a table with 500,000 rows (such as a table that contains addresses) then CRC32 will have an 8500:1 chance of duplicates without considering the length of the original string.

INTRODUCTION Keysum is a new and interesting technique (not a product) in the generation of keys within a database. It has particular application within Data Warehouses where keys are often made up of de-normalised alphanumeric data. THE PROBLEMS Data that has been de-normalised often has a primary key that is made up of a single string, a series of concatenated strings, or other data types that can be converted to strings. The key is traditionally costly in terms of storage requirements and access speed when used in an index. It is, however, vital to the usability of the data. The second issue is that in a data warehousing environment data may be loaded and assigned an arbitrary unique number as a key. If the data needs to be re-loaded at a later date, possibly with additions, then it is impossible to guarantee that the same arbitrary key will be assigned to the same row. THE SOLUTION The solution is simplicity itself. The generated key of the row should be the checksum of the string that makes up the unique key. This will, depending on the checksum algorithm chosen, generate a large integer that will be nearly unique within the scope of the data. For example using the industry standard CRC32 algorithm will generate a number in the range 0 to 4294967296, whilst using the Message Digest algorithm MD5 will generate a number between 0 and 3.4 * 1038.

MD5 on the other has the remote 6.8*1032:1 chance of generating a duplicate checksum. This is because it uses 128 bits rather than CRC32 which uses only 32 bits. When implementing the algorithm it is important to note that checksums normally return unsigned integers as their result. Your database and routines that access the checksum must all be able to handle the size of the result and ensure that they deal with the issue of signed versus unsigned variables. IS THIS FEATURE AVAILABLE NOW? There is no direct implementation of a checksum within the SQL Dialects of the major vendors currently available, however it can be implemented via an external procedure call. The author has implemented this technique within an Oracle7™ database. A daemon was created that took as its input the string and returned two values, the checksum and the length. This was connected to the database via a ‘Database Pipe’. When a checksum was required a PL/SQL stored procedure was called that placed the string into the database pipe and received the two values, the checksum and the length, back. The daemon was also implemented as a shared library so that it could be accessed from the command line and from other utilities that could call a shared ‘C’ library.

An optional parameter was included to allow the use of different algorithms in different contexts. For example where only a small data set needs a checksum key then CRC32 may be suitable, whilst MD5 is used only for the largest data sets. WHERE
DO ALGORITHM?

I

GET

A

CHECKSUM

The inevitably answer to this question is ‘From the Internet’. Any site that distributes the source for FreeBSD includes an implementation of CRC32. MD5 is also widely available. THE FUTURE DIRECTION The author hope that in the future that Database vendors such as Oracle will add the checksum function to their SQL dialects. Once available as a inbuilt function the need to implement checksums via external procedure calls will disappear and performance will be improved even more. It will also allow some standardisation is the choice and handling of the checksum algorithms. Data Management & Warehousing is the trading name of David M Walker, a
freelance Data Warehousing consultant.
Address: 138, Finchampstead Road, Wokingham, Berkshire, RG41 2NU, United Kingdom. WWW: http://www.datamgmt.com Telephone: +44 (0) 7050 028 911 Fax: +44 (0) 7050 028 912 Copyright © 1997 All rights reserved. All Copyrights and Trademarks respected
MD5 Copyright © 1991-2, RSA Data Security, Inc. Oracle7™ is a trademark of Oracle Corporation

WHAT IS THE MD5 MESSAGE-DIGEST ALGORITHM? MD5 is a message-digest algorithm. The algorithm takes as input a message of arbitrary length and produces as output a 128-bit "fingerprint" or "message digest" of the input. It is conjectured that it is computationally infeasible to produce two messages having the same message digest, or to produce any message having a given pre-specified target message digest. The MD5 algorithm is designed to be quite fast on 32-bit machines. In addition, the MD5 algorithm does not require any large substitution tables; the algorithm can be coded quite compactly.
Copyright (C) 1991-2, RSA Data Security, Inc. Created 1991. All rights reserved.