You are on page 1of 4

Delete Duplicate Rows

-- MySQL
WITH duplicates AS (
SELECT id, ROW_NUMBER() OVER(
PARTITION BY firstname, lastname, email
ORDER BY age DESC
) AS rownum
FROM contacts
)
DELETE contacts
FROM contacts
JOIN duplicates USING(id)
WHERE duplicates.rownum > 1;

-- PostgreSQL
WITH duplicates AS (
SELECT id, ROW_NUMBER() OVER(
PARTITION BY firstname, lastname, email
ORDER BY age DESC
) AS rownum
FROM contacts
)
DELETE FROM contacts
USING duplicates
WHERE contacts.id = duplicates.id AND duplicates.rownum > 1;

After some time, most applications will have duplicated rows resulting in a bad user
experience, higher storage requirements and less database performance. The cleaning
process is usually implemented in application code with complex chunking behavior as the
data does not fit into memory entirely. By using a Common Table Expression (CTE) the
duplicate rows can be identified and sorted by their importance to keep them. A single
delete query can afterward delete all duplicates except a specific number of ones to keep.
The former complex logic is done by one simple SQL query.

Notice: I have written a more extensive text about this topic on my database
focused website SqlForDevs.com: Delete Duplicate Rows

9
Table Maintenance After Bulk Modifications

-- MySQL
ANALYZE TABLE users;

-- PostgreSQL
ANALYZE SKIP_LOCKED users;

The database needs up-to-date statistics about your tables like the approximate amount of
rows, data distribution of values and more to calculate the most efficient way to execute
your query. Contrary to indexes that are automatically altered whenever a row affecting its
data is created, updated or deleted the statistics are not mutated on every change. A
recalculation is only triggered when a threshold of changes to a table is crossed.

Whenever you change a big part of a table, the number of affected rows may still be below
the statistics recalculation threshold but significant enough to make the statistics incorrect.
Some queries may become very slow as the database predicts the best query plan based on
the now incorrect information about the table. Therefore, you should analyze a table to
trigger the statistics recalculation after every significant change to ensure fast queries.

10
Querying Data

Most of the SQL queries you write and execute will be the ones reading data from the
database. It is the cornerstone of your application because not showing any data would
make it useless. But it's also the best opportunity to remove a lot of application boilerplate
by using more fancy querying approaches. In many use cases, those approaches also
improve the performance as you do the data processing where the data is instead of
transferring it all to your application.

The querying chapter will show you exceptional features like for-each loops within SQL,
some null handling tricks, pagination mistakes you probably do and many more. You should
read the tip about data refinement with common table expressions very closely; once you
understand it, you will use it very often. Trust me.

11
Reduce The Amount Of Group By Columns

SELECT actors.firstname, actors.lastname, COUNT(*) as count


FROM actors
JOIN actors_movies USING(actor_id)
GROUP BY actors.id

You probably learned long ago that when grouping on some columns, you must add all
SELECT columns to the GROUP BY. However, when you group on a primary key, all columns
of the same table can be omitted because the database will add them for you automatically.
Your query will be shorter and therefore easier to read and understand.

12

You might also like