PostgreSQL Internals Notes Compilation

Compiled: Rajorshi Sen
Date: Apr 8, 2023
12. Performance Troubleshooting:

a. disk spills -
1. explain analyze — HashAgg Batches: <no>
2. pg_stat_statements - temp_blks_read,temp_blks_written
b. logging SQL plans which run longer than 500 milliseconds: (a) session_preload_libraries , (b)
LOAD 'auto_explain’, (c) SET auto_explain.log_analyze TO on & (d) SET
auto_explain.log_min_duration TO 500;
c. add ‘pg_stat_statements’ to shared_preload_libraries , track_activity_query_size =
4094,pg_stat_statements.track = ALL and pg_stat_statements.max = 1000
select pg_stat_statements_reset();
postgresql.conf : track_io_time = on
c. pg_stat_statements — for queries that are not logged (not breaching the
log_min_statement_duration threshold).
d. lock - pg_locks , pg_stat_activity
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
blocking_locks.pid != blocked_locks.pid
NOT blocked_locks.granted
13. Autovacuum Tuning:
Transaction ID wraparound (4 bytes) , 2 to the power 32 -> 4 billion transactions.
1. Age of vacuum process.
2. Age of datfrozenxid of each database.
3. Age of relfrozenxid of each table in a database.
4. Run the vacuum verbose - after setting maitenance_work_mem or autovacuum_work_mem.
5. For large tables : (i) decrease autovacuum_vacuum_cost_delay (ii) Adjust
autovacuum_vaccum_scale_fator (decimal) to extremely low value for large tables.
6. For overall fall behind of autovacuum for all tables: (i) increase autovacuum_max_workers
(default is 3) (ii) increase proportionately autovacuum_vacuum_cost_limit 1500 (default is 200).
7. log autovacuum details log_autovacuum_min_duration.
Architecture of Autovacuum worker processes:
- Started by Postmaster upon request and details left by Autovacuum launch process for
Postmaster in shared memory (usual process of communication between processes is (a) shared
memory (b) messages).
- Upon completing the vacuum of a table, the autovacuum worker process will get terminated.
However, Autovacuum Launcher will launch another one if there are more tables to be vacuumed.
14. Storage Details: Nutshell [ Page Layout - 5 entities , Row Layout - 4 entities — Page Layout ->
Page Header (8 entities), Row Layout -> Tuple Header (8 entities)]
Storage Page Layout -> [PIFIS]
1. Page Header : 24 bytes
2. Item Data : 4 bytes (offset,length) , POINTERS to items (actual data)
3. Free Space
4. Item Space - Actual Data
5. Special Space - Index and other Access information to Actual Data.
Page Header: 24 bytes [lcf - lus - pp ]
1. pd_lsn - Log Sequence Number - Next byte after the last WAL change( to this page)
2. pd_checksum - page checksum
3. pg_flag - flag bits
4. pd_lower - offset to start of free space
5. pd_upper - offset to end of free space
6. pd_special - offset to special space.
7. pd_pagesize_version - pagesize and layout version number
8. pd_prune_xid - Oldest Unpruned XMAX on the page
[ UNPRUNED XMAX: unpruned means VACUUM has found the max XMAX of all rows that have transaction id
between XMIN and XMAX of transaction snapshot and hence need visibility , so the row was not removed.
Hence, a pruned page will mean that all the rows in the page do not have XMAX which is newer than a currently
running transaction XID.
]
Row layout:
1. Tuple Header
2. Optional null bitmap
- when there are nulls in the row
- present when TUPLE HEATHER's t_infomask’s heap_hasnull bit on.
- contains as many bits as the number of columns - bit 1 - has null , bit 0 - does not have null.
3. Optional object id - is set only if HEAP_HASOID bit is set in t_infomask.
4. User data
Tuple Header: 23 bytes [ mmc - xc - iih ] mohendra mohan choudhury - indian institute of hardware
- xc
1. t_min — minimum transaction id
2. t_max — maximum transaction id
3. t_cid — command id
4. t_xvac — transaction id for VACUUM operation - moving a row version
5. t_ctid — current tuple id (page id and then block id sequence)
6. t_infomask2 — Attributes indicated by flag bits
7. t_infomask — Attributes indicated by flag bits
8. t_hoff — offset to user data
15. Data Retrieval [ SELECT ]
- each attribute is traversed
- determined if it's null from Optional null bitmap.
- for variable length columns, the length of the column is determined from struct varlena .
- flags determine in struct varlena indicate:
(a) compressed or not
(b) column is a toast or not
3rd June 2021
——————
Parallel processes
16. max_parallel_workers_per_gather (default = 2) Gather node parallel processes
max_worker_processes (default = 8) total backend process in the system.
max_parallel_workers(default = 8) total parallel backend process in the system
gather/ Gather Merge
Loop in execution plan -> number of parallel processes
Operations:
Parallel Scan - Sequential , Index and Bitmap Heap
Parallel Join - Nested loop ( only outer table[ or driving table] blocks divided among the Parallel
Worker Backend Processes) , the inner table is accessed through index lookup and the access is not
parallel.
Merge join - the inner table is not parallel access
Hash Join - Can be (i) hash join (ii) parallel hash join with prefix ‘parallel’ (from
Postgresql 10 and above)
With the (prefix parallel) - the driving table (outer table) scan is parallelized as well as
the driven table (inner table) scan is parallelized.
The hash table is built from the whole driving table and is
shared with all the Coordinating processes which will probe the built (hash) table after picking up
tuples from the driven table.
Without the (prefix parallel) - the driving table scan (not parallelized) - each worker
process generates a Hash table from the driving table . The driven table scan is parallelized.
Parallel Aggregate: (i) Partial Aggregate Node: Each backend worker process
(ii) Finalize Aggregate Node: The leader re-aggregates the results.
Presence of GROUP BY/DISTINCT prevents optimizers from using Parallel
Aggregate.
Parallel Append: for UNION ALL / Partitioned Table Scan
5th June 2021
————————
17 .Architecture:
- Startup of Postgresql server: allocates share memory , starts background process -> postmaster.
PROCESSES
===========
- Postmaster: starts the other BACKGROUND processes: CAAWWS L -> Checkpointer ,
Autovaccum Launcher , Archiver , WAL Writer , DB Writer , Statistics collector and Logical
Replication process.
- Postmaster: Received connection request on default port 5432 and starts a BACKEND process for
every client request.
[ Postgresql does not have a BACKEND process that can serve multiple clients like the SHARED
SERVER process like in Oracle.
Hence, in PostgreSQL , all BACKEND processes are like DEDICATED SERVER processes like in
Oracle.]
- Postmaster: Starts other BACKEND processes like (i) Parallel Worker Processes (ii) Autovacuum
worker processes.
MEMORY
===============
SHARED:
- Shared_buffer : database pages containing data
- WAL_buffer : buffer container WAL entries
- CLOG buffer : transaction details containing the state of transaction (i) committed (ii) aborted (iii)
rollback.
LOCAL:
- temp_buffer: holds data related to temporary tables.
- work_mem: holds data related to GODUC operations (group by/order by /distinct/union)
- maintenance_mem : holds temporary data for maintenance operations like statistics collection ,
vacuum operation , reindex
ADDITIONAL SHARED:
- A - Access control Locks like light-weight locks, semaphores & shared/exclusive locks.
- B - Background process - Autovacuum , Checkpoint.
- T - Transaction related data like save-point and two-phase commit.
5th June 2021
——————-
18. Buffer Manager
1. Buffer Table : Consists of
- (i) Hash buckets , (ii) Hash Functions and (ii) Hash Slots in each bucket
- Hash Slot consists of Data entry - (i) Buffer_id from buffer pool (ii) Hash of Buffer
Tag (from Buffer descriptor)
2. Buffer Pool : Consists of
(i) Buffer pages
(ii) Array of buffer id’s as indexes to the buffer pages.
and
3. Buffer Descriptor Layer: Consists of a collection of buffer descriptors.
Each buffer descriptor is a structure of type BufferDesc which consists of elements: BBC FF UR
(i) Buffer Tag : is a combination of 5 details to identify a storage Page for PostgresQL - (i) Database
(ii) Tablespace (iii) relation (iv) block_number (v) type - freespace/visibility/data - 1,2,0
(ii) Buffer_id : which is the index of the Buffer Slot in the Buffer Pool
(iii) content lock bit & io_in_progress lock bit : Bit to indicate changing of content in buffer pool for the
particular buffer slot as well as for IO progress details.
(iv) refcount : increase by 1 when the block is being currently worked on , once done , it is
decremented.
(v) usecount: increase by 1 when it is accessed.
(vi) Flag bits: (a) valid (b) io_in_progress (c) dirty : various flags set when relevant
(vII) FreeNext : relevant for buffer descriptor which is in freelist .
Three states of Buffer descriptor:
1. Empty (usagecount = 0 , refcount = 0)
2. Pinned (usagecount > 0 , refcount > 0)
3. Unpinned (usagecount > 0 , refcount = 0 )
Page replacement Algorithm:
- Clock Sweep : move circular through the blocks —> Pick up a block
—> check if it is pinned (refcount = 0)
—> check if its usage_count =0 (if not decrease it
by 1)
—> repeat the steps till blocks with (refcount =0
and usagecount =0) are found.
- Buffer Manager Locks
1. Buffer Table lock - BuffMappingLock - Exclusive:for changed to hash slots
Shared: for lookup/reading of hash slots
2. Buffer Descriptor lock:
content_lock : Shared : When the data in the corresponding page (tuple) needs to be looked
up.
Exclusive: When the data in the corresponding page(tuple) needs to be changed.
io_progress_lock : when the corresponding buffer page needs to be retrieved from the storage.
spinlock : when values in the BuffDesc struct variables needs changed like:
dirty_bit/valid_bit/io_progress_bit needs to change.
- Buffer Manager Working to Read a page: [ KEY: 1. Buffer Table updated first —then—> Buffer pool
updated
2. Locks on Buffer pool is obtained indirect by
referencing bits and details in the Buffer Descriptor ]
1. A page that needs to be read is in the Buffer slot: Create buffer Tag —> Get BuffMappingLock —>
find the buffer_id in Buffer Table(traverse hash bucket and get hash slot) —> Release
BuffMappingLock —> read the buffer slot.
2. A page that needs to be read in not in the Buffer slot but there is empty slot to read a page: (steps
of 1) —> obtain Free Buffer from Freelist —> Exclusive BuffMappingLock (Buffer Table) —> Load up
Hash slot -> Load up Buffer pool (io_in_progress bit..)-> release BuffMappingLock -> read the buffer
slot
3. A page that needs to be read is not in the buffer slot but and there is no empty slot to read a page:
(steps of 1) —> ‘clock sweep’ page replacement algorithm -> Dirty bit -1(—> log flush
io_in_progress bit :1) -> BuffMappingLock (old & new) -> Hash Table update (both old and new) ->
Load up Buffer Pool -> release BuffMappingLock -> read the buffer slot.
15th August 2021
=============
Aspects of WAL Data:
1. Logical/Physical Structure
2. Internal Structure
3. Writing Data / Write process
4. Managing Segments
Aspects of Database Recovery:
1. Checkpoint
2. Commit
3. Recovery (controlfile)
4. Archive WAL segments
Purpose of Checkpoint : (a) Recovery Preparation (b) Shared_pool cleanup
a. Redo Point
b. Restore Rules.
1. Full Page Write Rules : a. A block touched for the first time.
b. Block touched after a checkpoint
2. Full Restore of a block: Rule: if a record to restore is inside a full block , it is restored irrespective
of the LSN.
If a record to be restored has its LSN (backup record) > the one in
shared_pool - then it's replayed.
c. Full Page Write & Backup Block:
xlog log entry (Ins/Del/Upd & commit) with LSN(location in Xlog unique)-> WAL memory -> WAL
segment -> REDO point (recent checkpoint start/recovery start) -> Checkpointer
writes XLOG record (holds REDO point)
1. First Insert TabA (since last checkpoint) -> shared_buffer TabA page entry (INSERT) ->
Page/Block pd_lsn changed(say lsn_0 to lsn_1) -> Entire Page into WAL buffer -> COMMIT ->
BACKEND process WAL buffer to Segment
-> XLog record COMMIT.
2. Second insert TabA(no new checkpoint since last one) -> shared_buffer TabA page
entry(INSERT) -> Page/Block pd_lsn changed(say lsn_1 to lsn_2) -> Xlog entry with header into
WAL buffer> COMMIT -> BACKEND process WAL buffer to Segment
xlog recovery (normal) : Loads page of TabA into shared_buffer -> compare LSN(shared_buffer
page) / LSN(xlog page) -> if LSN(xlog page) is greater than LSN(shared_buffer page) replay xlog
record from data portion.
xlog recovery(FPW) : Loads page of TabA into shared_buffer -> xlog Page(backup block) -> No LSN
comparison - > shared_buffer page overwritten.
d. Checkpoint Sequence: xlog entry -> clog buffer -> Dirty buffer cleanup -> control file updated
e. Checkpoint Frequency depends on:
1. max_wal_size
2. checkpoint_timeout
3. checkpoint_segments
4. checkpoint_completion_target
f. Logical & Physical Structure of XLOG entry.
1. Can be addressed by 8 Bytes - 2^8 = 1.6 exabytes
2. Too big single transaction log file -> broken down into WAL segments - 16MB.
3. WAL contains:
i. timline ID - 4 bytes (indicates number of times DB been restored)
DB created by initdb - timeline 1
Used for PITR
ii. LSN - location in WAL - 64bit integer - pointer XLogRecPtr - internally represented by 2
hexadecimal numbers of each 8 digits separated by a slash - pd_lsn - difference - pg_wal_lsn_diff().
1
iii. Naming - 000000010000000 000000 FF Timelineid, segment No. No. of 255 segments.
g. Internal layout:
Xlog (1GB) -> many WAL segments (16M) -> many pages (8K) (header page , normal page).
Header Page ->Header -> XLogLongPageHeaderData struct (which contains XlogPagerHeaderData
struct ,
Database System Identifier [from controlfile],
unsigned integers
)
+ Data -> Xlog record entry
Subsequent page -> Header -> XlogPageHeaderData -> Integers (1) Magic , (2) Info , (3) Timeline
, (4) this page’s xlog address & (5) xlp_rem_len contains length of xlog record continued from the earlier page.
-> Data -> xlog record.
xlog record : Header Section -> Struct XLogRecord - LRCX , LIP
[xl_tot_len , xl_rmid , xl_crc, xl_xid , xl_len , xl_info, xl_prev ]
xl_rmid - resource manager
- THIRST -> transaction, heap only , index , replication , sequence and tablespace
T-transaction - commits ->
resource manger(xl_rmid( :RM_XACT & information(xl_info):XLOG_XACT_COMMIT
db
recovery -> RM_HEAP xact_redo_commit() (x_info) -> replay record
H-heap only-> RM_HEAP
(resource manager) & XLOG_HEAP_INSERT (information) /XLOG_HEAP_DELETE (information)
/XLOG_HEAP_UPDATE (information)
RM_HEAP function (heap_xlog_insert() , heap_xlog_delete() and heap_xlog_update()
Data Section -> BACKUP block / Non-BACKUP block
BkpBlock : Identify block in DB - File/Fork/BlockNo/Byte Offset before block start/Byte length of

the block
xl_heap_insert : Identify Tuple in DB - TableID -> File - > (tablespace/DB/tablename) ->
BlockIDData/offset
h. PostgreSQL 9.5 - Xlog Record structure Modification
Backup Block - 4 Structures - XLogRecord
XLogRecordBlockHeader includes
(XLogRecordBlockImageHeader/XLogRecordBlockCompressed eader)
XLogRecordDataHeaderShort
xl_heap_insert
Data Object - Block Data
Non Backup Block - XLogRedcord
XLogRecordBlockHeader
XLogRecordDataHeaderShort
xl_heap_insert
Data Object: Tuple (xl_heap_header and inserted main data)
XLogRecordBlockHeader : (i) locates the data block (ii) length of the data portion
XLogRecordDataHeaderShort: length of new structure XL_HEAP_INSERT
xl_heap_insert : (i) offsetnumber of the tuple (ii) visibility flags
i. XLOG records - WAL buffer to WAL segments
ExtendClog() -> heap_insert() -> XLogInsert() -> finish_act_command() -> xLogInsert() ->
XLogWrite() -> TransactionIDCommitTree()
Xlog writing
i. Commit
ii. WAL buffer filled up
iii. WAL writer interval
iv. Running transaction commit/abort
j. Database Recovery PostgreSQL
7 states of PostgreSQL db: [start/production/shutdown. shutdown->recovery/shutdowning ,
recovery-crash/archive]
a. DB_STARTUP
b. DB_SHUTDOWN
c. DB_SHUTDOWNED_IN_RECOVERY
d. DB_SHUTDOWING
e. DB_IN_CRASH_RECOVERY
f. DB_IN_ARCHIVE_RECOVERY
h. DB_IN_PRODUCTION
Postgresql -> controlfile -> ‘in_production’ -> db recovery mode -> checkpoint record in Xlog(from
controlfile) -> REDO POINT (from checkpoint record) -> Xlog Page header(xl_rmid & xl_info) -> RM
(resource manager) perform relevant operation (insert/delete/update,ddl…) [by checking LSN of
replay XLog page vs shared buffer Page ]-> Next XLOG page : Pointer in the struct .
j. WAL segment management
WAL switch events:
1. 16 MB filled
2. pg_xlog_switch() command
3. archivemode = ON and archive_timeout - exceeded
4. PostgreSQL 9.5 -> WAL_MAX_SIZE reached - checkpoint triggers - REDO POINT advances -
old WAL SEGMENTS become redundant - Can be renamed and reused.
Archive config parameters:
1. archive_mode
2. archive_command
3. archive_timeout
Continuous archive - > copy WAL segments with Data Blocks/pages already checkpointed.
Postgresql Archive cleanup : pg_archivecleanup/ find command - find … -mtime +3d exec rm -f {} \
22nd November 2021
=================
High Availability
———————
1. libpq is a kernel library that is used by various postgresql clients to pass queries to Postgresql
Backend Server.
2. It provides a feature by which the connection can specify multiple endpoints and the option of the
connection.
The options are:
a. Either to connect to the Primary read-write
b. Or connect to any , which will include Primary as well as the replica. any
target_session_attr = read-write , any
Drawback: libpq cannot isolate the Primary , so Read only sessions might also land up in Primary.
Enhancement: In PostgreSQL 14 : one can isolate Read-Write (primary too).
a. Read only: connects to hosts that does not allow data modification
b. Primary
c. standby
d. Prefer-standby
3. HAProxy is a proxy for for

(a) Layer 4 - TCP
(b) Layer 7 - HTTP
4. HAProxy provides two features:
(a) Check if the endpoints are available (connection routing)
(b) Balance out the connection load. (Load balancing)
5. HAProxy provides a pgsql-check module that checks the availability of the provided PostgreSQL
endpoints.
pgsqlcheck - sends StartupMsg which is triggered whenever a session needs to connect to
PostgreSQL server.
But again it cannot differentiate between Primary and Replicas.
6. HAproxy can be used with xinetd service to differentiate between a PRIMARY or REPLICA. [ three
components : Shell script (checker), xinetd (service demonizer), haproxy (router)]
Steps:
a. Each postgresql Server will run xinetd service with a shell script. Xinetd is a service which
listens to a port and any incoming traffic can trigger some configured action.
The action here will be to return a HTTP status of 200 if it's a Primary , 206 if it is a Replica
and 503 if the endpoint is not available.
The xinetd itself will run on port 23267 on each postgresql server.
b. The HAproxy will check port 23267 that will trigger the xinetd service running the shell script
which will return either HTTP status 200 / 206 based on whether the postgreSQL server is PRIMARY
or REPLICA.
c. HAProxy will bind the TCP request on PORT 5000 to session which will return HTTP check
- http status code 200 and
will bind the TCP request on PORT 5001 to the session which will return HTTP check - http
status code 206.
d. The postgreSQL connection request on port 5000 on the HAProxy domain name or IP will
connect the session to PRIMARY and
The connection request on port 5001 on the HAProxy domain name or IP will connect the
session to REPLICA.
5th December 2021 CONCURRENCY CONTROL

===============
1. Postresql traditionally supported prevention of:
a. Dirty Read
b. Phantom Read
c. And Allows for Repeatable read.
2. From version 9.1 , Postgresql can prevent:
a. Write Skew — happens when two transactions concurrently try to update a table after querying a
table which is updated by the other one.
b. Read Only Transaction Skew - when a transaction sees different data for the same query
because another transaction has updated the data in the backend.
3. There are three types of conflicts.
a. Write-Write — lost updates — fixed by
b. Write-read — repeatable read inability — fixed by repeatable read isolation
c. Read-Write — write skew — fixed by serializable transaction isolation
4. Transaction Id :
a. 2 to the power 21 = 4.2 billion
b. Reserved transaction id : 0 invalid , 1 used in startup , 2 for marking a row as frozen
c. All transaction_id less than a particular value can be seen by the transaction and all
transaction_id more than that value cannot be seen by a particular transaction.
d. Transaction ID wraparound starts from 3 (because 2 is reserved for frozen transaction ID).
Hence, 2 will be less than any active transaction_id. Hence rows with t_xmin =2 are visible to all.
https://coupang-my.sharepoint.com/personal/rajorshi_coupang_com/Documents/backup_macbook_
2/Notes Cooked/Postgresql/Notes/ACID-Read Consistency.docx
- The vacuum marks the ‘xmin’ with a value that ensures that all
transactions find it as old and hence renders the row visible to the transaction.
e. Tuple header — pgpageinspect - select * from
heap_page_items(get_raw_page(‘<schema_name.object_name’> , <page_number>));
Page Header — pgpageinspect - select * from
page_header(get_raw_page(‘<schema_name.object_name’> , <page_number>));
Optional Null bitmap - one nullable column results in 1 row of bits ,
0 bit -> does not contain null, 1 -> contain null.
This happens only when
HEAP_HASNULL bit of t_infomask is set.
Number of nullable columns = no of bits in t_infomask2.
So, number of rows of bitmap = number of bits in t_infomask2.
Optional object id - HEAP_HASOID bit is set in t_infomask.
- present just before user data starting from t_hoff
Userdata
ctid - current tuple id - pair of block number and row number.
cid (field3 in the tuple header) - command id = The DML sequence number in a transaction
that created this row.
f. DML statements (update/delete):

t_max - not zero value
Vacuum full - removes rows with non-zero t_max value.
5. Total space available:
a. pgstattuple (extension) — Example: select * from pgstattuple('tbl');
b. pgfreespace(extension) — Example: select * from pg_freespace('tbl');
SELECT *, round(100 * avail/8192 ,2) as "freespace ratio" from pg_freespace(‘tbl’)
limit 1;
c. FreeSpaceMap (FSM) is a map which holds the free space information for a particular relation.
PostgreSQL checks the FSM to find the page and block where it should insert a row.
6. Commit log:
- clog buffer - four status - (I) committed (ii) aborted (iii) in_progress (iv) sub_committed.
- array - index_id —> transaction_id (txid)
Value - in_progress/committed/aborted
- max size of clogs 256K , each page = 8K
- Total Size of CLOG continuously increases with the increase in concurrent transactions.
Completed transactions are cleaned up by the VACUUM process along with relations.
7. Transaction Snapshot:
select txid_current_snapshot();
xmin:xmin:xip_list
xmin- minimum txid for the rows that can be seen by current transaction
xmax- maximum txid for visibility , txid larger than xmax will be created after the start of the current
transaction.
xip_list: list of txid that are active and are within xmin and xmax.
8. Isolation level possible:
Read committed - every DML change can be seen in another transaction
Repeatable read - No DML change done by another transaction can be seen which had executed
after the start of the current transaction. Can be seen only after commit of the current transaction.
Serializable - attains repeatable read as well as prevents WRITE SKEW that happens when two
transactions do DML after reading objects changed by the other transaction.
9. Visibility Check:
Obtained with the aid of:
a. Current transaction_id
b. Min txid and max txid of the transaction snapshot
c. Status of the transaction obtained from the CLOG buffer.
There are three Rule categories - aborted , committed and in_progress

The three rules categories will have 10 rules.
Aborted - Rule 1
In_progress - rule 2 to rule 4
Committed - Rule 5 to Rule 10.
— basic way to check:
(a) get current_txid
(b) get_current_txid_snapshot
(c) find the status of the transaction.
Note: t_xmax = 0 (row has not been deleted or updated) — available for visibility
Transaction status = in_progress (visibility restricted , clog_status(txid)=‘in-progress’ then
current_xid = (must) t_xmin
- Hint bits - t_infomask bit HEAP_MIN_COMMITTED set.
- avoids making calls to - TransactionIdDidCommit and TransactionIdDidAbort
which retrieves values from the CLOG array for status of the current transactions
- Whenever a PostgreSQL reads/writes a tuple , the HINT bits are set to void making calls
to functions that needs to access CLOG (to avoid bottleneck).
- HINT bits:
#define HEAP_XMIN_COMMITTED 0x0100 /* t_xmin committed */
#define HEAP_XMIN_INVALID 0x0200 /* t_xmin invalid/aborted */
#define HEAP_XMAX_COMMITTED 0x0400 /* t_xmax committed */
#define HEAP_XMAX_INVALID 0x0800 /* t_xmax invalid/aborted */
- Functions:
InProgress Committed Aborted
TrancationIdInPro TransactionIdDidC TransactionIdDid
gress ommit Abort
- Phantom read — ‘read committed’ transactions.
- Lost Updates - Repeatable Read - FIRST UPDATER WINS.
- Serializable Snapshot
Functions Used for Serialization Locks
1 2 3
CheckTargetForConfli PreCommit_CheckForSerializat
CheckTargetForConflictsIn ctsOut ionFailure
Implements first
FIRST-COMMITTER WINs.
Invoked during COMMIT of tx in
Serializable isolation mode.
Checks for RW Anomalies will result in only the
conflict when first commit to get through, rest
Creates SIREAD lock DML (ins/del/upd) in a aborted.
(SerializableRead). Serializable
Assumptions:
a. SERIALIZABLEEXACT is omitted
b.Explanation of the mentioned functions becomes simple with SERIALIZABLEEXACT
b. transaction_id is used to explain SIREAD lock rather than virtual txid.
- SIREAD locks are issued inside a ‘transaction serializable isolation’.
- Index can have SIREAD lock - for ‘index only scans’.
- SIREAD lock - rw-conflict:
Read - SIREAD lock { <transaction_id>, {tuple_id, page_id , relation_id})
Write conflict { <read_transaction_id> , <write_tarnsction_id> , {tuple_id,page_id,relation_id}
- FIRST COMMITTER WINS.
25th December 2021 REPLICATION
================
1. What is the sequence of events when a new Standby is configured and started?
Answer:
1. Startup process starts in the standby — This process is responsible to read the WAL segments
and apply the WAL entries to the STANDBY database.
2. Startup starts the WALReceiver.
3. WALreceiver contacts the Primary Server through the network , which starts the WALSender in
the primary.
[completes a handshake , after a TCP handshake]
4. WALReceiver post heartbeat with ACK message which will contain LSN that is replayed.
5. WALSender sends the additional WAL entries by doing a (Primary_LSN - Standby_LSN).
6. Startup will apply the Delta WAL entries corresponding to delta LSN and bring the PostgreSQL
cluster to SYNC.
2. Which version of PostgreSQL allows for replication slots?
Answer:
https://www.interdb.jp/pg/pgsql11.html
PostgreSQL 9.3: Standby lagging -> WAL segment files recycled -> Standby rebuild, Pre-emptive
Fix: High Value for WAL_KEEP_SEGMENTS.
PostgreSQL 9.4: Replication Slot:
Definition: Persistent record of the state of the replica that is kept on the Master Server when the
replica is offline and disconnected.
- It ensures that a WAL segment file is evaluated and sent to the standby for application of the
change vectors before marking it eligible for being recycled.
3. What are the states that pg_stat_replication ‘state’ column will show about the standby in the
Primary?
Answer:
1. Back UP — when pg_basebackup is executing in the standby to get the backup set from the
primary.
2. Catch UP — when Replica is lagging behind primary by some LSN (log sequence number) ,
primary streams WAL entries through network connection to apply in Replica.
3. Start UP — when REPLICA is applying the WAL entries in the replica.
4. Streaming — when the Primary is sending WAL log binary entries to standby through network
connection.
4. What are the OS processes involved in Streaming replication?
Answer:
Primary:
1. Walsender
Standby:
2. Walreceiver
3. Startup
5. What are the different states at which a WALSENDER in the primary can be in?
Answer:
1. Streaming —> sending binary WAL change vectors to WALReceiver in Standby.
2. Start UP —> establishing the HANDSHAKE with the WALReceiver in Standby.
3. Back UP —-> when PG_BASEBACKUP is getting executed in the standby host to create the
dataset to build the Standby.
4. Catch UP —-> when the primary is sending the WAL entries that are needed for the standby to
have LSN current with Primary.
6. What are the details that an ACK message from Standby to Primary contains?
Answer:
It contains 4 information:
Three LSN - LSN of the location in the WAL in STANDBY where
1. Latest LSN where the log buffer were written in Standby
2. Latest LSN where the WAL entries were flushed to WAL logs
3. Latest LSN where the replay in the standby has started.
4. Timestamp of response from Replica to Primary.
7. How do you make multiple Standby synchronous with Primary?
Answer:
Step 1: provide name of the cluster in the parameter cluster_name of the standby.
Step 2: provide the name of the postgreSQL cluster name in the following parameter of the primary.
synchronous_standby_names = ‘’;
8. How would you prevent stalling of the primary if a synchronous standby is halted?
Answer:
- Halting of the synchronous standby will result in stalling of the Primary because unless there is an
ACKNOWLEDGEMENT message from the synchronous Standby.
- To unfreeze the Primary, the parameter synchronous_standby_names = ‘’ should be cleared and
the database should be reloaded.
pg_ctl -D $PGDATA reload
9. How does PostgreSQL ensure ‘NO DATA LOSS’ in case the primary crashes and one of standby
needs to switchover?
Answer:
- By setting parameter ’synchronous_standby_names’ to the standby names in a comma
separated string.
- The standby’s names are set by configuring parameter ‘cluster_name’ in the postgresql.conf
- The first string in the parameter ’synchronous_standby_names’ is the SYNC standby and
Primary should get the ACK message from the SYNC standby which contains latest LSN for:
WAL writes to segment file into cache.
WAL flush to segment file into disk
WAL replay on the standby by startup process.
Time of ACK send.
The parameter synchronous_commit = ‘on’ should be configured for ‘NO DATA LOSS’.
10. What are the steps that the primary PostgreSQL Cluster executes when a transaction commits
when no standby host is involved?
Answer:
6 steps.
1. ExtendClog(): Allocates a CLOG (commit log) entry in CLOG BUFFER(memory) after creating a
page if necessary.
2. heap_insert(): Make Heap entries (table)/ B tree entries (index)
3. XLogInsert():Write to WAL entries into the WAL segment (OS buffer).
4. finish_xact_command(),
XLogInsert():Write commit information into WAL segment( OS buffer).
5. XLogWrite() :Flush WAL entries to the WAL segment (disk) .
6. TransactionIdCommitTree(): CLOG entry updated.
11: What are the steps that Primary Server executes when there is replication involved?
Answer:
1. Primary: BackEnd process - LSN -> LSN_1 XLogInsert()
LSN_1 -> LSN_2 finish_xact_command() , XLogInsert()
WAL buffer flush.
2. Primary: WAL Sender entries (disk) -> WAL receiver (standby)
3. Primary : SynRepWaitForLSN() - Primary wait for ACK response from standby
4. Standby: write() - WAL Segment (buffer) —> ACK1 send to Primary
5. Standby: fsync() - WAL Segment (Physical storage/file) —> ACK2 send to primary
6. Standby: Startup process replay WAL entries stored in WAL segments of Standby.
7. Primary : Upon receipt of ACK1(when synchronous_commit = ‘off’)/ ACK2(when
synchronous_commit=‘on’) , the will complete SynRepWaitForLSN() , releasing the latch and the
session will now be able to process transaction again.
12. What are the ‘SYNC_STATE’ states that a Standby can have in the primary?
Answer:
1. SYNC - the first standby in the CSV list of parameter SYNCHRONOUS_STANDBY_NAMES
2. Potential - the rest of the standby in the CSV list of the parameters mentioned above.
3. ASYNC. - standby that is not mentioned in the SYNCHRONOUS_STANDBY_NAMES list.
13. What happens if the STANDBY which is SYNC sends ACK later than the STANDBY that is
potential?
Answer:
Primary will only consider ACK from the standby that is SYNC.
14. How is a Standby detection failure detected?
Answer:
Parameter WAL_SENDER_TIMEOUT (default 60 seconds) determines that.
If the Standby does not post a heartbeat in this time , the primary terminates the WALSENDER
process that is meant to send STREAM to the corresponding WALRECEIVER of the standby that is
not responding.
15. What happens when the SYNC standby fails?
Answer:
The potential STANDBY (with the priority next to the failing SYNC standby) will be promoted to
SYNC.
28th December 2021 Backup and Point In Time Recovery
================
1. Base Backup
- pg_start_backup() - (i) Initiate FPW mode
(ii) checkpoints the database
(iii) Switches WAL segment logs
(iv) creates a BACKUP Label file under ‘data’ directory.
- pg_end_backup() - (i) reset to Non-FPW mode
(ii) backup END XLOG entry
(iii) WAL file switch
(iv) create backup history file (in pg_wal/pg_xlog)
(v) delete backup label file (after its included in the backupset - pg_wal/pg_xlog)
- Backup Label file contains:
1. BACKUP checkpoint information -
start of backup,
location of checkpoint in xlog
2. Backup label name
- Backup History file contains:
1. Contents of backup label file.
2. Timestamp of pg_end_backup() execution.
- Sample backup label file: {wal segment} . { offset value at the time of backup start } .
backup
- Sample backup history file: {wal segment} . { offset value at the time of backup start } .
.
backup done
2. Point In Time Recovery [B-backup file 5R -redo,recovery,restore,recovery,recovery T-timeline
history ]
a. Backup label file - contains [recovery start checkpoint & start backup time]
b. Redo point - obtained from Backup label file , location of the start of recovery in the
WAL log.
c. restore_command - command to copy backup files to restore location restore_command =
'cp /mnt/server/archivedir/%f %p'
d. recovery_target_time - end time of recovery [ empty will result in recovery till the logs are
available] . Example: recovery_target_time = "2021-12-18 12:05 GMT".
e. recovery.conf - contains (c) & (d) [ restore_command & recovery_target_time ]
f. recovery.signal - PG 12 : empty file in $PGDATA directory to trigger a restore and recovery [
no recovery.conf , is contents
g. timeline history file - Contents of Backup label file copied over to pg_wal directory and then if
archive_command parameter is set , the backup history file will be copied over to archive
destination.
Additional Parameter.
recovery_end_command Sets the shell command that will be
executed once at the end of recovery.
recovery_min_apply_delay 0 Sets the minimum delay for applying
changes during recovery.
recovery_target Set to "immediate" to end recovery as soon
as a consistent state is reached.
recovery_target_action pause Sets the action to perform upon reaching
the recovery target.
recovery_target_inclusive on Sets whether to include or exclude
transactions with a recovery target.
recovery_target_lsn Sets the LSN of the write-ahead log
location up to which recovery will proceed.
recovery_target_name Sets the named restore point up to which
recovery will proceed.
recovery_target_timeline latest Specifies the timeline to recover into.
recovery_target_xid Sets the transaction ID up to which
recovery will proceed.
STEPS:
- recovery.conf / ( postgresql.conf + recovery.signal) — will contain two parameters.
1. restore_command
2. restore_target_date
- PostgreSQL upon start will find either POSTGRESQL.CONF or RECOVERY.SIGNAL and get into
recovery mode.
- restore all the DATAFILE and WAL log segments in the respective location.
- Backup Checkpoint - from - Back Label File through function read_backup_label()
- REDO POINT from backup checkpoint. [ Start WAL location - backup label file]
- For each WAL entry from the WAL logs retrieved from ARCHIVE Location and kept in temporary
location $PGDATA , the timestamp is compared with RECOVERY_TARGET_TIME , if its mentioned.
- Once a WAL log has been processed and applied to the PostgreSQL cluster , it is deleted to
preserve space.
3. TIMELINE and TIMELINE HISTORY
TimeLineID: 4 byte unsigned int - starting from 1.
- For each recovery , the timeline is incremented by 1.
TimeLine History: “8 digit new TimeLineId”.history
Details that it holds:
1. timeLineID
2. LSN - of WAL switch
3. Reason - why timeline changed.
Example:
TimeLineID - 1
TimeLine - ‘00000001’
Sequence: Database Incident that requires recovery at time T1 —>
Restore Backup taken at T1 - x minutes —>
Recover Data till time T1 [ recovery.conf/postresql.conf will have parameters
RECOVERY_TARGET_TIME & RESTORE_COMMAND] —>
Completion of recovery ->
Increase TimeLineID from 1 —> 2
——
If the Cluster is recovered using to archive files:
a. ‘00000010000000000000009’
b. ‘‘00000001000000000000000A’
- The fresh recovered database will get a timeLineID 2 and new file
corresponding to the last file WAL log file applied with an increased timeLineID.
‘00000002000000000000000A’.
Second PITR on the same database Sequence:
Database Incident that requires recovery at time T1 + y —>
Restore Backup taken at (T1 + y) - x —>
[ Note: beyond first recovery ,mandatory parameter that needs to be mentioned is
RECOVERY_TARGET_TIMELINE, so now the parameters would be:
RECOVERY_TARGET_TIME
RESTORE_COMMAND
RECOVERY_TARGET_TIMELINE
Example:
restore_command = ‘cp /mnt/server/archivedir/%f %p ‘
recover_target_time = ’2021012018 12:15:00 GMT’
recovery_target_timeline = 2
]
Redo Point information - from - backup label file
Recovery target time - from - recovery.conf/postgresql.conf
WAL logs from the archive location with TimeLine marked as 2.—>
TimeLine increased from 2 to 3 —>
WAL log created with TimeLined id 3.
If the Cluster is recovered using to archive files:
a. ‘00000010000000000000009’
b. ‘‘00000002000000000000000A’
- The freshly recovered database will get a timeLineID 3 and new file
corresponding to the last WAL log file applied with an increased timeLineID.
‘00000003000000000000000A’.

PostgreSQL Internals Notes Compilation

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PostgreSQL Internals Notes Compilation

Uploaded by

Copyright:

Available Formats

Compiled: Rajorshi Sen

Date: Apr 8, 2023

12. Performance Troubleshooting:

Database System Identifier [from controlfile],

RM_HEAP function (heap_xlog_insert() , heap_xlog_delete() and heap_xlog_update()

Data Section -> BACKUP block / Non-BACKUP block

BkpBlock : Identify block in DB - File/Fork/BlockNo/Byte Offset before block start/Byte length of

3. HAProxy is a proxy for for

5th December 2021 CONCURRENCY CONTROL

f. DML statements (update/delete):

There are three Rule categories - aborted , committed and in_progress

You might also like