MVCC and Transactions
Multi-Version Concurrency Control (MVCC) is AxiomDB’s mechanism for deciding which
row versions are visible to a given statement or transaction. This page documents
the current implementation: the RowHeader format, the actual
TransactionSnapshot type, the single-active-transaction TxnManager, and the
server’s Arc<RwLock<Database>> concurrency model.
Implementation status: current code implements snapshot visibility, READ COMMITTED and REPEATABLE READ semantics, rollback/savepoints, deferred page reclamation, and concurrent read-only queries. It does not yet implement row-level writer concurrency, deadlock detection,
SELECT ... FOR UPDATE, or full SSI. Those are planned in Phases 13.7, 13.8, and 13.8b.
Core Concepts
Transaction ID (TxnId)
Every explicit transaction receives a unique, monotonically increasing u64
identifier. The value 0 means “no active write transaction” and is used by
autocommit reads.
Transaction Snapshot
A snapshot is the compact visibility token used by the current runtime.
#![allow(unused)]
fn main() {
pub struct TransactionSnapshot {
pub snapshot_id: u64,
pub current_txn_id: u64,
}
}
Meaning:
snapshot_id = max_committed + 1at the moment the snapshot is takencurrent_txn_id = txn_idof the active transaction, or0for read-only / autocommit reads
A row version is visible when:
txn_id_created == current_txn_idortxn_id_created < snapshot_id- and
txn_id_deleted == 0, ortxn_id_deleted >= snapshot_idand the delete was not performed bycurrent_txn_id
RowHeader — Per-Row Versioning
Every heap tuple begins with a RowHeader:
Offset Size Field Description
──────── ────── ─────────────── ───────────────────────────────────────────────
0 8 txn_id_created transaction that inserted this row version
8 8 txn_id_deleted transaction that deleted this row (0 = live)
16 4 row_version incremented on UPDATE
20 4 _flags reserved for future use
Total: 24 bytes
The full lifecycle of a row version:
INSERT in txn T1:
RowHeader { txn_id_created: T1, txn_id_deleted: 0, row_version: 0 }
DELETE in txn T2:
RowHeader { txn_id_created: T1, txn_id_deleted: T2, row_version: 0 }
UPDATE in txn T2 (implemented as DELETE + INSERT):
Old version: RowHeader { txn_id_created: T1, txn_id_deleted: T2, row_version: N }
New version: RowHeader { txn_id_created: T2, txn_id_deleted: 0, row_version: N+1 }
Batch DELETE and Full-Table DELETE
When a DELETE has a WHERE clause, TableEngine::delete_rows_batch() collects all
matching (page_id, slot_id) pairs and calls HeapChain::delete_batch() with them.
Each affected slot receives xmax = txn_id and deleted = 1 in a single pass per
page. The WAL receives one WalEntry::Delete per matched row (for correct per-row
redo/undo).
When a DELETE has no WHERE clause or is a TRUNCATE TABLE, the executor takes a
different path:
HeapChain::scan_rids_visible()collects live(page_id, slot_id)pairs without decoding row data.HeapChain::delete_batch()marks all slots dead in O(P) page I/O.- A single
WalEntry::Truncateis appended to the WAL instead of N per-row Delete entries.
The MVCC visibility result is identical to the per-row path: every slot has
xmax = txn_id and deleted = 1, so any snapshot with xmax ≤ txn_id will see
the row as deleted after the transaction commits. Concurrent readers that took their
snapshot before this transaction began continue to see all rows as live throughout
the delete — standard snapshot isolation.
Visibility Function
#![allow(unused)]
fn main() {
fn is_visible(row: &RowHeader, snap: &TransactionSnapshot, self_txn_id: u64) -> bool {
let created_visible =
row.txn_id_created == self_txn_id || row.txn_id_created < snap.snapshot_id;
let not_deleted =
row.txn_id_deleted == 0
|| (row.txn_id_deleted >= snap.snapshot_id
&& row.txn_id_deleted != self_txn_id);
created_visible && not_deleted
}
}
TxnManager
The current TxnManager is a single-active-transaction coordinator. Read-only
operations access it via shared refs for snapshot creation; mutating operations
access it via &mut TxnManager for begin/commit/rollback.
#![allow(unused)]
fn main() {
pub struct TxnManager {
wal: WalWriter,
next_txn_id: u64,
max_committed: u64,
active: Option<ActiveTxn>,
}
}
This is the main reason the current server runtime is still single-writer for
mutating statements: there is only one ActiveTxn slot for the whole opened
database, not one write transaction owner per connection.
BEGIN
1. Verify `active.is_none()`
2. Assign `txn_id = next_txn_id`
3. Append `Begin` to the WAL
4. Set `active = Some(ActiveTxn { txn_id, snapshot_id_at_begin, ... })`
5. Increment `next_txn_id`
COMMIT
1. Append `Commit` to the WAL
2. Flush/fsync via the current durability policy or fsync pipeline
3. Advance `max_committed`
4. Clear `active`
ROLLBACK
1. Replay undo ops in reverse order
2. Append `Rollback` to the WAL
3. Clear `active`
Copy-on-Write B+ Tree and MVCC
The B+ Tree’s CoW semantics interact naturally with MVCC. When a writer creates a new page for an insert, concurrent readers continue accessing the old tree structure through the old root pointer they loaded at query start. The old pages are freed only when the writer’s root swap is complete AND all readers that loaded the old root have finished.
Since Phase 7.4, old pages enter the deferred free queue instead of being returned to the freelist immediately. This allows concurrent readers to continue accessing old tree structures through their snapshot while the writer has already swapped the root. Pages are released for reuse only when no active reader snapshot predates the free operation.
Current Server Lock Model (Phase 7.4 / 7.5)
The server wraps Database in Arc<RwLock<Database>>:
- SELECT, SHOW, system variable queries acquire a read lock (
db.read()). Multiple readers execute concurrently with zero coordination. - INSERT, UPDATE, DELETE, DDL, BEGIN/COMMIT/ROLLBACK acquire a write lock
(
db.write()). Only one writer at a time. - A read that already started keeps its snapshot while a writer commits.
- New mutating statements queue behind the write lock at whole-database granularity.
- Row-level locking is not implemented yet. That work starts in Phase 13.7.
The read-only executor path (execute_read_only_with_ctx) takes &dyn StorageEngine
(shared ref) and &TxnManager (shared ref), ensuring it cannot mutate any state.
Isolation Levels — Implementation
READ COMMITTED
On every statement start within a transaction, a new snapshot is taken. The
TransactionSnapshot passed to the analyzer and executor is refreshed per statement.
REPEATABLE READ
The snapshot is taken once at BEGIN and held for the entire transaction’s lifetime.
All statements use the same snapshot.
The default isolation level is REPEATABLE READ (matching MySQL’s default). Autocommit single-statement queries always use READ COMMITTED semantics since there is only one statement to see.
INSERT … SELECT — Snapshot Isolation
INSERT INTO target SELECT ... FROM source executes the SELECT under the same
snapshot that was fixed at BEGIN. This is critical for correctness:
The Halloween problem is a classic database bug where an INSERT ... SELECT
on the same table re-reads rows it just inserted, causing an infinite loop (the
database inserts rows, those rows qualify the SELECT condition, they get inserted
again, ad infinitum).
AxiomDB prevents this automatically through MVCC snapshot semantics:
- The snapshot is fixed at
BEGIN:snapshot_id = max_committed + 1 - Rows inserted by this statement get
txn_id_created = current_txn_id - The MVCC visibility rule: a row is visible only if
txn_id_created < snapshot_id - Since
current_txn_id ≥ snapshot_id, newly inserted rows are never visible to the SELECT scan within the same transaction
Before BEGIN: source = {row_A (xmin=1), row_B (xmin=2)}
Snapshot taken: snapshot_id = 3
INSERT INTO source SELECT * FROM source:
SELECT sees: row_A (1 < 3 ✅), row_B (2 < 3 ✅) → 2 rows
Inserts: row_C (xmin=3), row_D (xmin=3) → 3 ≮ 3 ❌ not re-read
SELECT stops: only 2 original rows were seen
After COMMIT: source = {row_A, row_B, row_C, row_D} ← exactly 4 rows
This also means rows inserted by a concurrent transaction that commits after
this transaction’s BEGIN are not seen by the SELECT — consistent snapshot
throughout the entire INSERT operation.
MVCC on Secondary Indexes (Phase 7.3b)
Secondary indexes store (key, RecordId) pairs — they do not contain transaction
IDs or version information. Visibility is always determined at the heap via the row’s
txn_id_created / txn_id_deleted fields.
Lazy Index Deletion
When a row is DELETEd, non-unique secondary index entries are not removed. The
heap row is marked deleted (txn_id_deleted = T), and the index entry becomes a
“dead” entry. Readers filter dead entries via is_slot_visible() during index scans.
Unique, primary key, and FK auto-indexes still have their entries deleted immediately because the B-Tree enforces key uniqueness internally.
UPDATE and Dead Entries
When an UPDATE changes an indexed column:
- Unique/PK/FK indexes: old entry deleted, new entry inserted (immediate)
- Non-unique indexes: old entry left in place (lazy), new entry inserted
Both old and new entries coexist in the B-Tree. The old entry points to a heap row
whose values no longer match the index key; is_slot_visible() filters it out.
Heap-Aware Uniqueness
When inserting into a unique index, if the key already exists, AxiomDB checks heap
visibility before raising a UniqueViolation. If the existing entry points to a dead
row (deleted or uncommitted), the insert proceeds — dead entries don’t block re-use
of the same key value.
HOT Optimization
If an UPDATE does not change any column that participates in any secondary index, all index maintenance is skipped for that row — no B-Tree reads or writes. This is inspired by PostgreSQL’s Heap-Only Tuple (HOT) optimization.
ROLLBACK Support
Every new index entry (from INSERT or UPDATE) is recorded as
UndoOp::UndoIndexInsert in the transaction’s undo log. On ROLLBACK, these entries
are physically removed from the B-Tree. Old entries (from lazy delete) were never
removed, so they’re naturally restored.
Vacuum
Dead index entries accumulate until vacuum removes them. A dead entry is one where
is_slot_visible(entry.rid, oldest_active_snapshot) returns false — the pointed-to
heap row is deleted and no active snapshot can see it.
VACUUM — Dead Row and Index Cleanup (Phase 7.11)
The VACUUM command physically removes dead rows and dead index entries:
VACUUM orders; -- vacuum a specific table
VACUUM; -- vacuum all tables
Heap Vacuum
For each page in the heap chain, VACUUM finds slots where txn_id_deleted != 0
and txn_id_deleted < oldest_safe_txn (the deletion is committed and no active
snapshot can see it). These slots are zeroed via mark_slot_dead(), making them
invisible to read_tuple() without even reading the RowHeader.
Under the current Arc<RwLock<Database>> architecture, oldest_safe_txn = max_committed + 1 — all committed deletions are safe because no reader holds an
older snapshot.
Index Vacuum
For each catalog-visible B-Tree index, VACUUM performs a full B-Tree scan and checks heap visibility for each entry. Dead entries (pointing to vacuumed or deleted heap slots) are batch-deleted from the B-Tree. If bulk delete rotates the root, the updated root page ID is persisted back to the catalog in the same transaction.
Clustered Vacuum
Clustered tables now use a different purge path:
- descend to the leftmost clustered leaf, then walk
next_leaf - purge leaf cells whose
txn_id_deleted != 0 && txn_id_deleted < oldest_safe_txn - free any overflow chain owned by the purged row
- conditionally defragment clustered leaves with high freeblock waste
- scan clustered secondary indexes and delete only entries whose PK bookmark no longer resolves to a physically present clustered row
This last rule is important: clustered secondary cleanup uses physical existence after purge, not snapshot visibility. An uncommitted clustered delete is invisible to the writer snapshot, but it is not safe to purge.
What VACUUM Does Not Do (Yet)
- VACUUM FULL / table rewrite: heap pages still do slot-level cleanup only, while clustered pages only do local defragmentation; there is no full-table rewrite pass yet.
- Automatic triggering: VACUUM must be invoked manually via SQL. Autovacuum with threshold-based triggering is planned.
LP_UNUSED and
updates a free space map (FSM) so the space can be reused. AxiomDB keeps the
heap path simpler: dead slots are zeroed but heap pages are not compacted or
tracked through an FSM yet. Clustered VACUUM now does local leaf defragmentation
instead, while full-table rewrite remains a separate enhancement.
Epoch-Based Page Reclamation (Phase 7.8)
When a writer performs Copy-on-Write on a B-Tree node, old pages are deferred
(not immediately freed) because a concurrent reader might still reference them.
The SnapshotRegistry tracks which snapshots are active across all connections:
#![allow(unused)]
fn main() {
pub struct SnapshotRegistry {
slots: Vec<AtomicU64>, // slot[conn_id] = snapshot_id or 0
}
}
- Register: connection sets its slot before executing a read query
- Unregister: connection clears its slot after the query completes
- oldest_active(): returns the minimum non-zero slot, or
u64::MAXif idle
On flush(), the storage layer calls release_deferred_frees(oldest_active())
to return only pages freed before the oldest active snapshot to the freelist.
Pages freed after the oldest snapshot remain queued until all readers advance.
lowest_active_start via an active transaction list.
InnoDB uses clone_oldest_view() to merge all active ReadViews.
AxiomDB uses a fixed-size atomic slot array (1024 slots) — O(N) scan without
locking. Under the current RwLock model all slots are 0 during flush (writer
has exclusive access), so the behavior is identical to the previous
u64::MAX sentinel. The infrastructure is forward-compatible with
future concurrent reader+writer models.
Clustered UPDATE In-Place Undo (Phase 39.22)
UndoClusteredFieldPatch
When fused_clustered_scan_patch applies zero-allocation in-place UPDATE
(writing only the changed field bytes directly into the page buffer), the WAL
undo log records a UndoClusteredFieldPatch entry instead of the full-row-image
UndoClusteredRestore:
#![allow(unused)]
fn main() {
UndoClusteredFieldPatch {
table_id: u32,
key: Vec<u8>, // PK bytes for leaf descent
old_header: RowHeader, // txn_id_created, row_version to restore
field_deltas: Vec<FieldDelta>, // each carries [u8;8] old_bytes
}
}
On ROLLBACK, the handler:
- Looks up the clustered root via
clustered_roots - Descends to the owning leaf via
clustered_tree::descend_to_leaf_pub - Searches for the cell by PK key via
clustered_leaf::search - For each
FieldDelta, computesfield_abs_off = row_data_abs_off + delta.offsetand callspatch_field_in_placewithdelta.old_bytes[..delta.size] - Restores the
RowHeaderviaupdate_row_header_in_place
This is O(fields_changed × 1) per row, vs O(row_size) for a full UndoClusteredRestore.
FieldDelta Inline Arrays
FieldDelta stores field bytes as fixed-size [u8;8] arrays (InnoDB field
values for fixed-size types are at most 8 bytes for BIGINT/REAL):
#![allow(unused)]
fn main() {
// Before Phase 39.22
pub struct FieldDelta { pub offset: u16, pub size: u8, pub old_bytes: Vec<u8>, pub new_bytes: Vec<u8> }
// After Phase 39.22
pub struct FieldDelta { pub offset: u16, pub size: u8, pub old_bytes: [u8;8], pub new_bytes: [u8;8] }
}
WAL serialization writes only size bytes, so the on-disk format is identical.
Recovery code reads back (u16, u8, &[u8]) tuples and copies them into [u8;8]
arrays — no heap allocation during recovery either.
UndoClusteredFieldPatch stores them as inline [u8;8] arrays in the undo log entry — no heap allocation per field per row. For UPDATE t SET score = score + 1 across 25K rows, this eliminates ~50K allocations vs the old Vec<u8>-per-delta approach.
⚠️ Planned: Serializable Snapshot Isolation (Phase 7)
SSI detects read-write dependencies between concurrent transactions and aborts transactions that form a dangerous cycle. The implementation follows the algorithm from Cahill et al. (2008):
- Each transaction tracks its
rw-antidependencies(read sets and write sets). - At commit time, if the dependency graph contains a dangerous cycle (two transactions
where each reads something the other wrote), one transaction is aborted with
40001 serialization_failure.
SSI provides true serializability (the strongest isolation level) with overhead proportional to the number of concurrent transactions and conflicts, not to the total number of rows.