You are on page 1of 6

Table structure to be used for the examples

Table "enterprisedb.update_target"
Column
|
Type
| Modifiers
-------------------+------------------------+----------column_id
| integer
| not null
columne_to_update | character varying(100) |
Indexes:
"update_target_pkey" PRIMARY KEY, btree (column_id)
Table "enterprisedb.update_source"
Column
|
Type
| Modifiers
-------------------+------------------------+----------column_id
| integer
| not null
source_for_update | character varying(100) |
Indexes:
"update_source_pkey" PRIMARY KEY, btree (column_id)
count_update_target | count_update_source
---------------------+--------------------1310720 |
10240
(1 row)
edb=# select * from update_source t1 where not exists (select 1 from update_targ
et t2 where t2.column_id=t1.column_id);
column_id |
source_for_update
-----------+---------------------------------2621436 | this is an new value for 2621437
2621438 | this is an new value for 2621439
2621440 | this is an new value for 2621441
select count(*) from update_target where columne_to_update is null;
count
------0
(1 row)
Now suppose I want to update all those rows, in update_target which has a matchi
ng ID (column_id) in update_source. For this simple table (where I have to updat
e only one column) the statement may be as simple as below.
update update_target t1 set columne_to_update =(select source_for_update from up
date_source t2 where t1.column_id=t2.column_id);
This could be a bit more complex, with multiple select statements (one for each
column to be updated).
Now let's see what the plan looks like for this simple case:
edb=# explain analyze update update_target t1 set columne_to_update =(select sou
rce_for_update from update_source t2 where t1.column_id=t2.column_id);
QUERY PL
AN
--------------------------------------------------------------------------------------------------------------------------------------------------------Update on update_target t1 (cost=0.00..4392654.13 rows=1341439 width=14) (actu
al time=18187.897..18187.897 rows=0 loops=1)

-> Seq Scan on update_target t1 (cost=0.00..4392654.13 rows=1341439 width=1


4) (actual time=0.058..5259.089 rows=1310720 loops=1)
SubPlan 1
-> Index Scan using update_source_pkey on update_source t2 (cost=0.
00..3.27 rows=1 width=30) (actual time=0.002..0.002 rows=0 loops=1310720)
Index Cond: (t1.column_id = column_id)
Total runtime: 18188.005 ms
Keep in mind that the complexity of the plan would had been double if there were
two columns to be updated.
So let me add an additional column here:
edb=# alter table update_target add column column_2_for_update integer;
ALTER TABLE
edb=# alter table update_source add column column_2_source integer;
ALTER TABLE
edb=# update update_source set column_2_source=column_id;
UPDATE 10240
Do you notice an additional Subplan in the plan below?
edb=# explain analyze update update_target t1
set columne_to_update =(select source_for_update from update_source t2
where t1.column_id=t2.column_id),
column_2_for_update=(select column_2_source from update_source t2
where t1.column_id=t2.column_id);
QUERY PL
AN
--------------------------------------------------------------------------------------------------------------------------------------------------------Update on update_target t1 (cost=0.00..8573070.33 rows=1310720 width=10) (actu
al time=22256.890..22256.890 rows=0 loops=1)
-> Seq Scan on update_target t1 (cost=0.00..8573070.33 rows=1310720 width=1
0) (actual time=0.096..9104.759 rows=1310720 loops=1)
SubPlan 1
-> Index Scan using update_source_pkey on update_source t2 (cost=0.
00..3.27 rows=1 width=30) (actual time=0.002..0.002 rows=0 loops=1310720)
Index Cond: (t1.column_id = column_id)
SubPlan 2
-> Index Scan using update_source_pkey on update_source t2 (cost=0.
00..3.27 rows=1 width=4) (actual time=0.002..0.002 rows=0 loops=1310720)
Index Cond: (t1.column_id = column_id)
Total runtime: 22256.999 ms
(9 rows)
That's a difference of 22%.
So let's fire the query and see it's effect.
edb=# update update_target t1
set columne_to_update =(select source_for_update from update_source t2
where t1.column_id=t2.column_id),
column_2_for_update=(select column_2_source from update_source t
where t1.column_id=t2.column_id);
But what will happen, if a row in update_target is not able to find a match in u
pdate_source? Well that will be set to null. Let's confirm

edb# select count(*) from update_target where columne_to_update is null;


count
--------1300483
(1 row)
Actually the above UPDATE statement is supposed to update everything at update_t
arget by fetching data from update_source. If something is not found then it is
unknown i.e. NULL. So we need to restrict the rows which needs to be updated, wi
th a filter which says "update only those rows for which a match exists in updat
e_source". So the query will look something like this:
update update_target t1
set columne_to_update =(select source_for_update from update_source t2
where t1.column_id=t2.column_id),
column_2_for_update=(select column_2_source from update_source t2
where t1.column_id=t2.column_id)
where exists (select 1 from update_source t3 where t3.column_id=t1.column_id);
Let's see how this one performs.
edb=# explain analyze update update_target t1
set columne_to_update =(select source_for_update from update_source t2
where t1.column_id=t2.column_id),
column_2_for_update=(select column_2_source from update_source t2
where t1.column_id=t2.column_id)
where exists (select 1 from update_source t3 where t3.column_id=t1.colum
n_id);
QUERY PLA
N
------------------------------------------------------------------------------------------------------------------------------------------------------Update on update_target t1 (cost=14461.94..81679.34 rows=10240 width=16) (actu
al time=1454.999..1454.999 rows=0 loops=1)
-> Hash Join (cost=14461.94..81679.34 rows=10240 width=16) (actual time=117
9.335..1331.664 rows=10237 loops=1)
Hash Cond: (t3.column_id = t1.column_id)
-> Seq Scan on update_source t3 (cost=0.00..172.02 rows=10240 width=1
0) (actual time=0.013..7.162 rows=10240 loops=1)
-> Hash (cost=11054.07..11054.07 rows=1310720 width=10) (actual time=
1178.699..1178.699 rows=1310720 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 53760kB
-> Seq Scan on update_target t1 (cost=0.00..11054.07 rows=13107
20 width=10) (actual time=0.011..710.234 rows=1310720 loops=1)
SubPlan 1
-> Index Scan using update_source_pkey on update_source t2 (cost=0.
00..3.27 rows=1 width=30) (actual time=0.004..0.004 rows=1 loops=10237)
Index Cond: (t1.column_id = column_id)
SubPlan 2
-> Index Scan using update_source_pkey on update_source t2 (cost=0.
00..3.27 rows=1 width=4) (actual time=0.003..0.004 rows=1 loops=10237)
Index Cond: (t1.column_id = column_id)
Total runtime: 1468.033 ms
(14 rows)
The plan is a bit simpler if we had to update just one column.

edb=# explain analyze update update_target t1


set columne_to_update =(select source_for_update from update_source t2
where t1.column_id=t2.column_id)
where exists (select 1 from update_source t3 where t3.column_id=t1.colum
n_id);
QUERY PLA
N
-----------------------------------------------------------------------------------------------------------------------------------------------------Update on update_target t1 (cost=20556.94..54328.97 rows=10240 width=20) (actu
al time=1222.157..1222.157 rows=0 loops=1)
-> Hash Join (cost=20556.94..54328.97 rows=10240 width=20) (actual time=100
2.328..1105.557 rows=10237 loops=1)
Hash Cond: (t3.column_id = t1.column_id)
-> Seq Scan on update_source t3 (cost=0.00..172.02 rows=10240 width=1
0) (actual time=0.013..7.244 rows=10240 loops=1)
-> Hash (cost=17149.07..17149.07 rows=1310720 width=14) (actual time=
1001.109..1001.109 rows=1310720 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 53800kB
-> Seq Scan on update_target t1 (cost=0.00..17149.07 rows=13107
20 width=14) (actual time=0.011..521.482 rows=1310720 loops=1)
SubPlan 1
-> Index Scan using update_source_pkey on update_source t2 (cost=0.
00..3.27 rows=1 width=30) (actual time=0.004..0.004 rows=1 loops=10237
)
Index Cond: (t1.column_id = column_id)
Total runtime: 1237.864 ms
(11 rows)
The difference in response time is of around 12.5%.
Let's check if this update goes as we expect.
edb=# update update_target t1
set columne_to_update =(select source_for_update from update_source t2
where t1.column_id=t2.column_id),
column_2_for_update=(select column_2_source from update_source t2
where t1.column_id=t2.column_id)
where exists (select 1 from update_source t3 where t3.column_id=t1.colum
n_id);
UPDATE 10237
Let's check now if any row has got NULL values?
edb=# select count(*) from update_target where columne_to_update is null;
count
------0
(1 row)
Well a much more better and cleaner code is as below:
update update_target t1
set columne_to_update=t2.source_for_update ,

t1.column_2_for_update=t2.column_2_source
from update_source t2 where t1.column_id=t2.column_id;
or
update update_target t1
set (columne_to_update,t1.column_2_for_update)=(t2.source_for_update ,t2
.column_2_source)
from update_source t2 where t1.column_id=t2.column_id;
The plan for above option is as below.
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------Update on update_target t1 (cost=14688.35..14950.99 rows=10240 width=50) (actu
al time=1095.327..1095.327 rows=0 loops=1)
-> Hash Join (cost=14688.35..14950.99 rows=10240 width=50) (actual time=945
.722..973.844 rows=10237 loops=1)
Hash Cond: (t2.column_id = t1.column_id)
-> Seq Scan on update_source t2 (cost=0.00..172.02 rows=10240 width=4
4) (actual time=0.012..7.936 rows=10240 loops=1)
-> Hash (cost=11227.12..11227.12 rows=1331239 width=10) (actual time=
944.189..944.189 rows=1310720 loops=1)
Buckets: 262144 Batches: 1 Memory Usage: 53760kB
-> Seq Scan on update_target t1 (cost=0.00..11227.12 rows=13312
39 width=10) (actual time=0.009..490.419 rows=1310720 loops=1)
Total runtime: 1111.271 ms
(8 rows)
Let's see how much of an overhead is caused cause of multiple columns. We test i
t by generating a plan for update of only one column.
edb=# explain analyze update update_target t1 set columne_to_update=t2.source_fo
r_update from update_source t2 where t1.column_id=t2.column_id;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------Update on update_target t1 (cost=15719.74..15982.38 rows=10240 width=46) (actu
al time=1124.990..1124.990 rows=0 loops=1)
-> Hash Join (cost=15719.74..15982.38 rows=10240 width=46) (actual time=977
.541..1005.278 rows=10237 loops=1)
Hash Cond: (t2.column_id = t1.column_id)
-> Seq Scan on update_source t2 (cost=0.00..172.02 rows=10240 width=4
0) (actual time=0.236..7.509 rows=10240 loops=1)
-> Hash (cost=12015.47..12015.47 rows=1424717 width=10) (actual time=
976.187..976.187 rows=1310720 loops=1)
Buckets: 262144 Batches: 1 Memory Usage: 53760kB
-> Seq Scan on update_target t1 (cost=0.00..12015.47 rows=14247
17 width=10) (actual time=0.012..524.587 rows=1310720 loops=1)
Total runtime: 1141.219 ms
(8 rows)
Surprisingly there is hardly any difference in response time and cost as well ha
s not sufferred much.

There are some RDBMS implementations which has a dedicated statement (MERGE) for
achieving UPSERT (Update/Insert based on some criteria). Via that they support
something similar to what I have explained here, i.e. updating values from anoth
er/same table without using a subquery.
Now, I know there will be people talking and commenting about lack of UPSERT in
PostgreSQL. Well, I don't quite agree to that. Do you know you can return the de
tails from an UPDATE query and PostgreSQL v9.1 (and onwards) allows you to capit
alize that returned in data to insert/delete records. So, it makes it a bit more
powerful than the MERGE query. Below is an example.
If you remember there were three rows which did not exist at update_target but w
ere present in update_source (if not then just scroll up)
-- Using RETURNING to achieve an UPSERT/MERGE
edb=# with updated as (update update_target t1 set columne_to_update=t2.source_f
or_update from update_source t2 where t1.column_id=t2.column_id
returning t1.column_id)
insert into update_target
select * from update_source t_src where t_src.column_id not in
(select column_id from updated);
INSERT 0 3

You might also like