You are on page 1of 2

akeaway: Removing duplicate records from tables is sometimes an arduous task, especially if the source table does not

contain a primary key field. Heres how you can take advantage of a new SQL Server 2005 feature to remove duplicates quickly and efficiently.

Duplicate Records
Duplicate records can occur numerous ways, such as loading source files too many times, keying the same data more than once, or from just bad database coding. Having a primary key on your table (and you always should have one) can will in the removal of the duplicate records, but even w/ a primary key it is never a fun task to have handed to you to complete. In todays example I will demonstrate how you can use a common-table expression (CTE) in 2005 to easily remove duplicate entries from a table, even when the table does not contain a primary key field. The script below creates the SalesHistory table, and loads 10 records into the table. Two of these records are duplicate entries, which I am defining as having the same values for the Product, SaleDate, and SalePrice fields. In the real world, these may not be duplicate records at all, but for our example well assume that they are.
IF OBJECT_ID('SalesHistory') IS NOT NULL DROP TABLE SalesHistory CREATE TABLE [dbo].[SalesHistory] ( [Product] [varchar](10) NULL, [SaleDate] [datetime] NULL, [SalePrice] [money] NULL ) GO INSERT INTO SalesHistory(Product, SaleDate, SalePrice) SELECT 'Computer','1919-03-18 00:00:00.000',1008.00 UNION ALL SELECT 'BigScreen','1927-03-18 00:00:00.000',91.00 UNION ALL SELECT 'PoolTable','1927-04-01 00:00:00.000',139.00 UNION ALL SELECT 'Computer','1919-03-18 00:00:00.000',1008.00 UNION ALL SELECT 'BigScreen','1927-03-25 00:00:00.000',92.00 UNION ALL SELECT 'PoolTable','1927-03-25 00:00:00.000',108.00 UNION ALL SELECT 'Computer','1919-04-01 00:00:00.000',150.00 UNION ALL SELECT 'BigScreen','1927-04-01 00:00:00.000', 123.00 UNION ALL SELECT 'PoolTable','1927-04-01 00:00:00.000', 139.00 UNION ALL SELECT 'Computer','1919-04-08 00:00:00.000', 168.00

Now that I have some duplicate records loaded up, I can start writing my routine to remove them. The trick with removing duplicates is to find a way to delete them from your source

table, rather than grouping all of the records in the source table together and creating a new table w/ the new entries. Deleting from the source usually requires a way to uniquely identify one of the unique records and then remove it. With some crafty TSQL, this is a relatively easy task to do when a primary key defined on the table. Luckily, the new CTE feature in SQL Server 2005 makes it very easy to remove these duplicates, with or without a primary key. The script below defines my CTE. I am using a windowing function named DENSE_RANK to group the records together based on the Product, SaleDate, and SalePrice fields, and assign them a sequential value randomly. This means that if I have two records with the exact same Product, SaleDate, and SalePrice values, the first record will be ranked as 1, the second as 2, and so on.
;WITH SalesCTE(Product, SaleDate, SalePrice, Ranking) AS ( SELECT Product, SaleDate, SalePrice, Ranking = DENSE_RANK() OVER(PARTITION BY Product, SaleDate, SalePrice ORDER BY NEWID() ASC) FROM SalesHistory ) DELETE FROM SalesCTE WHERE Ranking > 1

Because a CTE acts as a virtual table, I am able to process data modification statements against it, and the underlying table will be affected. In this case, I am removing any record from the SalesCTE that is ranked higher than 1. This will remove all of my duplicate records. To verify my duplicates have been removed, I can review the data in the table, which should now contain 8 records, rather than the previous 10.
SELECT * FROM SalesHistory

Be Creative
Common-table expressions are a very useful new feature in SQL Server 2005. You can use them for recursive queries, removing duplicates, and even simple looping procedures. Explore this neat new feature to find some creative ways to solve some of your everyday database problems.

You might also like