Professional Documents
Culture Documents
Table "enterprisedb.update_target"
Column
|
Type
| Modifiers
-------------------+------------------------+----------column_id
| integer
| not null
columne_to_update | character varying(100) |
Indexes:
"update_target_pkey" PRIMARY KEY, btree (column_id)
Table "enterprisedb.update_source"
Column
|
Type
| Modifiers
-------------------+------------------------+----------column_id
| integer
| not null
source_for_update | character varying(100) |
Indexes:
"update_source_pkey" PRIMARY KEY, btree (column_id)
count_update_target | count_update_source
---------------------+--------------------1310720 |
10240
(1 row)
edb=# select * from update_source t1 where not exists (select 1 from update_targ
et t2 where t2.column_id=t1.column_id);
column_id |
source_for_update
-----------+---------------------------------2621436 | this is an new value for 2621437
2621438 | this is an new value for 2621439
2621440 | this is an new value for 2621441
select count(*) from update_target where columne_to_update is null;
count
------0
(1 row)
Now suppose I want to update all those rows, in update_target which has a matchi
ng ID (column_id) in update_source. For this simple table (where I have to updat
e only one column) the statement may be as simple as below.
update update_target t1 set columne_to_update =(select source_for_update from up
date_source t2 where t1.column_id=t2.column_id);
This could be a bit more complex, with multiple select statements (one for each
column to be updated).
Now let's see what the plan looks like for this simple case:
edb=# explain analyze update update_target t1 set columne_to_update =(select sou
rce_for_update from update_source t2 where t1.column_id=t2.column_id);
QUERY PL
AN
--------------------------------------------------------------------------------------------------------------------------------------------------------Update on update_target t1 (cost=0.00..4392654.13 rows=1341439 width=14) (actu
al time=18187.897..18187.897 rows=0 loops=1)
t1.column_2_for_update=t2.column_2_source
from update_source t2 where t1.column_id=t2.column_id;
or
update update_target t1
set (columne_to_update,t1.column_2_for_update)=(t2.source_for_update ,t2
.column_2_source)
from update_source t2 where t1.column_id=t2.column_id;
The plan for above option is as below.
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------Update on update_target t1 (cost=14688.35..14950.99 rows=10240 width=50) (actu
al time=1095.327..1095.327 rows=0 loops=1)
-> Hash Join (cost=14688.35..14950.99 rows=10240 width=50) (actual time=945
.722..973.844 rows=10237 loops=1)
Hash Cond: (t2.column_id = t1.column_id)
-> Seq Scan on update_source t2 (cost=0.00..172.02 rows=10240 width=4
4) (actual time=0.012..7.936 rows=10240 loops=1)
-> Hash (cost=11227.12..11227.12 rows=1331239 width=10) (actual time=
944.189..944.189 rows=1310720 loops=1)
Buckets: 262144 Batches: 1 Memory Usage: 53760kB
-> Seq Scan on update_target t1 (cost=0.00..11227.12 rows=13312
39 width=10) (actual time=0.009..490.419 rows=1310720 loops=1)
Total runtime: 1111.271 ms
(8 rows)
Let's see how much of an overhead is caused cause of multiple columns. We test i
t by generating a plan for update of only one column.
edb=# explain analyze update update_target t1 set columne_to_update=t2.source_fo
r_update from update_source t2 where t1.column_id=t2.column_id;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------Update on update_target t1 (cost=15719.74..15982.38 rows=10240 width=46) (actu
al time=1124.990..1124.990 rows=0 loops=1)
-> Hash Join (cost=15719.74..15982.38 rows=10240 width=46) (actual time=977
.541..1005.278 rows=10237 loops=1)
Hash Cond: (t2.column_id = t1.column_id)
-> Seq Scan on update_source t2 (cost=0.00..172.02 rows=10240 width=4
0) (actual time=0.236..7.509 rows=10240 loops=1)
-> Hash (cost=12015.47..12015.47 rows=1424717 width=10) (actual time=
976.187..976.187 rows=1310720 loops=1)
Buckets: 262144 Batches: 1 Memory Usage: 53760kB
-> Seq Scan on update_target t1 (cost=0.00..12015.47 rows=14247
17 width=10) (actual time=0.012..524.587 rows=1310720 loops=1)
Total runtime: 1141.219 ms
(8 rows)
Surprisingly there is hardly any difference in response time and cost as well ha
s not sufferred much.
There are some RDBMS implementations which has a dedicated statement (MERGE) for
achieving UPSERT (Update/Insert based on some criteria). Via that they support
something similar to what I have explained here, i.e. updating values from anoth
er/same table without using a subquery.
Now, I know there will be people talking and commenting about lack of UPSERT in
PostgreSQL. Well, I don't quite agree to that. Do you know you can return the de
tails from an UPDATE query and PostgreSQL v9.1 (and onwards) allows you to capit
alize that returned in data to insert/delete records. So, it makes it a bit more
powerful than the MERGE query. Below is an example.
If you remember there were three rows which did not exist at update_target but w
ere present in update_source (if not then just scroll up)
-- Using RETURNING to achieve an UPSERT/MERGE
edb=# with updated as (update update_target t1 set columne_to_update=t2.source_f
or_update from update_source t2 where t1.column_id=t2.column_id
returning t1.column_id)
insert into update_target
select * from update_source t_src where t_src.column_id not in
(select column_id from updated);
INSERT 0 3