Professional Documents
Culture Documents
https://docs.yugabyte.com/latest/architecture/docdb/persistence/
Persistence
Once data is replicated using Raft across a majority of the YugabyteDB tablet-peers, it is applied to each
tablet peer’s local DocDB document storage layer.
Storage Model
This storage layer is a persistent key to object/document store. The storage model is shown in the figure
below:
The keys and the corresponding document values are described below.
DocDB key
The keys in DocDB document model are compound keys consisting of one or more hash organized
components, followed by zero or more ordered (range) components. These components are stored in
their data type specific sort order; both ascending and descending sort order is supported for each
ordered component of the key.
DocDB value
The values in DocDB document data model can be:
primitive types: such as int32, int64, double, text, timestamp, etc.
non-primitive types (sorted maps): These objects map scalar keys to values, which could be
either scalar or sorted maps as well.
This model allows multiple levels of nesting, and corresponds to a JSON-like format. Other data structures
like lists, sorted sets etc. are implemented using DocDB’s object type with special key encodings. In
DocDB, hybrid timestamps of each update are recorded carefully, so that it is possible to recover the state
of any document at some point in the past. Overwritten or deleted versions of data are garbage-collected
as soon as there are no transactions reading at a snapshot at which the old value would be visible.
Encoding documents
The documents are stored using a key-value store based on RocksDB, which is typeless. The documents
are converted to multiple key-value pairs along with timestamps. Because documents are spread across
many different key-values, it’s possible to partially modify them cheaply.
SubKey1 = {
SubKey2 = Value1
SubKey3 = Value2
},
SubKey4 = Value3
Keys stored in RocksDB consist of a number of components, where the first component is a "document
key", followed by a few scalar components, and finally followed by a MVCC timestamp (sorted in reverse
order). Each component in the DocumentKey, SubKey, and Value, are PrimitiveValues, which are just
(type, value) pairs, which can be encoded to and decoded from strings. When encoding primitive values in
keys, a binary-comparable encoding is used for the value, so that sort order of the encoding is the same
as the sort order of the value.
Assume that the example document above was written at time T10 entirely. Internally the above
example’s document is stored using 5 RocksDB key value pairs:
DocumentKey1, T10 -> {} // This is an init marker
DocumentKey1, SubKey1, T10 -> {}
Deletions of Documents and SubDocuments are performed by writing a single Tombstone marker at the
corresponding value. During compaction, overwritten or deleted values are cleaned up to reclaim space.
Each data type supported in YSQL (or YCQL) is represented by a unique byte. The type prefix is also
present in the primary key’s hash or range components
We use a binary-comparable encoding to translate the value for each YCQL type to strings that go to the
KV-Store.
Furthermore, YCQL has a distinction between rows created using Insert vs Update. We keep track of this
difference (and row level TTLs) using a "liveness column", a special system column invisible to the user. It
is added for inserts, but not updates: making sure the row is present even if all non-primary key columns
are deleted only in the case of inserts.
YCQL - Collection type example
Consider the following YCQL table schema:
CREATE TABLE msgs (user_id text ,
msg_id int ,
msg text ,
COPY
Insert a row
T1: INSERT INTO msgs (user_id, msg_id, msg, msg_props)
'hello'});
The entries in DocDB at this point will look like the following:
(hash1, 'user1', 10), liveness_column_id, T1 -> [NULL]
COPY
The entries in DocDB at this point will look like the following:
(hash1, 'user1', 10), liveness_column_id, T1 -> [NULL]
COPY
The entries in DocDB at this point will look like the following:
(hash1, 'user1', 10), liveness_column_id, T1 -> [NULL]
Delete a row
Delete a single column from a row.
T4: DELETE msg_props
FROM msgs
WHERE user_id = 'user1'
AND msg_id = 10 ;
COPY
Even though, in the example above, the column being deleted is a non-primitive column (a map), this
operation only involves adding a delete marker at the correct level, and does not incur any read overhead.
The logical layout in DocDB at this point is shown below.
(hash1, 'user1', 10), liveness_column_id, T1 -> [NULL]
Note: The KVs that are displayed in “strike-through” font are logically deleted.
Note: The above is not the physical layout per se, as the writes happen in a log-structured manner. When
compactions happen, the space for the KVs corresponding to the deleted columns is reclaimed, as shown
below.
(hash1, 'user1', 10), liveness_column_id, T1 -> [NULL]
views int,
category text,
Although they are added out of order, we get a sorted view of the items in the key value store when
reading, as shown below:
(h1, key1), T1 -> 15, value1
Using an iterator, it is easy to reconstruct the hash and set contents efficiently.