Storage Engine
The storage engine is the lowest user-accessible layer in AxiomDB. It manages raw 16-kilobyte pages on disk or in memory, provides a freelist for page allocation, and exposes a simple trait that all higher layers depend on.
The StorageEngine Trait
#![allow(unused)]
fn main() {
pub trait StorageEngine: Send + Sync {
fn read_page(&self, page_id: u64) -> Result<PageRef, DbError>;
fn write_page(&self, page_id: u64, page: &Page) -> Result<(), DbError>;
fn alloc_page(&self, page_type: PageType) -> Result<u64, DbError>;
fn free_page(&self, page_id: u64) -> Result<(), DbError>;
fn flush(&self) -> Result<(), DbError>;
fn page_count(&self) -> u64;
fn prefetch_hint(&self, start_page_id: u64, count: u64) { ... }
fn set_current_snapshot(&self, snapshot_id: u64) { ... }
fn deferred_free_count(&self) -> usize { ... }
}
}
All methods take &self — there is no &mut self anywhere in the trait. Mutable state
is managed entirely through interior mutability:
write_page: acquires a per-page exclusiveRwLock(fromPageLockTable) for the duration of thepwrite(2)call. Two transactions writing different pages proceed in full parallelism with zero contention.alloc_page: acquiresMutex<FreeList>only during the bitmap scan (microseconds), then acquires the page lock to initialise the new page.free_page: acquiresMutex<FreeList>briefly to add the page to the free bitmap.flush: acquiresMutex<FreeList>to persist the freelist, then callsfdatasync.
This design mirrors InnoDB (buf_page_get_gen with per-page block_lock, no &mut on
the buffer pool) and PostgreSQL (per-buffer atomic state field, MarkBufferDirty is
&self-equivalent).
&self-equivalent buffer pool with per-page locks.
AxiomDB follows the same pattern: a sharded PageLockTable (64 shards, one
RwLock<HashMap<u64, Arc<RwLock<()>>>> per shard) eliminates the global
&mut self bottleneck and is the architectural unlock for concurrent writer support
in phases 40.4–40.12.
read_page returns an owned PageRef — a heap-allocated copy of the 16 KB page data.
This is a deliberate change from the original &Page borrow: owned pages survive mmap
remaps (during grow()) and page reuse (after free_page), which is essential for
concurrent read/write access. The copy cost is ~0.5 us from L2/L3 cache — the same cost
PostgreSQL pays when copying a page from the buffer pool into backend-local memory.
Page Format
Every page is exactly PAGE_SIZE = 16,384 bytes (16 KB). The first HEADER_SIZE = 64
bytes are the page header; the remaining PAGE_BODY_SIZE = 16,320 bytes are the body.
Page Header — 64 bytes
Offset Size Field Description
──────── ────── ──────────────── ──────────────────────────────────────
0 8 magic `PAGE_MAGIC` — identifies valid pages
8 1 page_type PageType enum (see below)
9 1 flags page flags (`PAGE_FLAG_ALL_VISIBLE`, future bits)
10 2 item_count item/slot count for the page-local format
12 4 checksum CRC32c of body bytes `[HEADER_SIZE..PAGE_SIZE]`
16 8 page_id This page's own ID (self-identifying)
24 8 lsn Log Sequence Number of last write
32 2 free_start First free byte offset in the body (format-specific)
34 2 free_end Last free byte offset in the body (format-specific)
36 28 _reserved Future use
Total: 64 bytes
The CRC32c checksum covers only the page body [HEADER_SIZE..PAGE_SIZE], not the
header itself. On every read_page, AxiomDB verifies the checksum and returns
DbError::ChecksumMismatch if it fails.
Page Types
#![allow(unused)]
fn main() {
pub enum PageType {
Meta = 0, // page 0: database header + catalog roots
Data = 1, // heap pages holding table rows
Index = 2, // current fixed-slot B+ Tree internal and leaf nodes
Overflow = 3, // continuation pages for large values
Free = 4, // freelist / unused pages
ClusteredLeaf = 5, // slotted clustered leaf: full PK row inline
ClusteredInternal = 6, // slotted clustered internal: varlen separators
}
}
Clustered Page Primitives (Phase 39.1 / 39.2 / 39.3)
The clustered index rewrite is landing in the storage layer first. Two new page types now exist even though the SQL executor still uses the classic heap + secondary-index path:
ClusteredLeaf— slotted page with variable-size cells storing:key_lenrow_len- inline
RowHeader - primary-key bytes
- row payload bytes
ClusteredInternal— slotted page with variable-size separator cells storing:right_childkey_len- separator key bytes
ClusteredInternal keeps one extra child pointer in the header as
leftmost_child, so logical child access still follows the classical B-tree
rule n keys -> n + 1 children.
ClusteredInternal body:
[16B header: is_leaf | num_cells | cell_content_start | freeblock_offset | leftmost_child]
[cell pointer array]
[free gap]
[cells: right_child | key_len | key_bytes]
That design keeps the storage primitive compatible with the current traversal contract:
find_child_idx(search_key)returns the first separator strictly greater than the keychild_at(0)readsleftmost_childchild_at(i > 0)reads theright_childof separator celli - 1
SQL-Visible Clustered DDL + INSERT Boundary (Phases 39.13 / 39.14)
The storage rewrite is no longer purely internal. CREATE TABLE now uses the
clustered root when the SQL definition contains an explicit PRIMARY KEY:
TableDef.root_page_idis the generic primary row-store rootTableDef.storage_layouttells higher layers whether that root is heap or clustered- heap tables still allocate
PageType::Data - clustered tables now allocate
PageType::ClusteredLeaf - logical PRIMARY KEY metadata on clustered tables points at that same clustered root
The first SQL-visible clustered write paths now exist too:
INSERTon explicit-PRIMARY KEYtables routes directly intoclustered_tree::insert(...)orrestore_exact_row_image(...)- clustered
AUTO_INCREMENTbootstraps from clustered rows instead of heap scans - non-primary clustered indexes are maintained as PK bookmarks through
axiomdb-sql::clustered_secondary SELECTon clustered tables now routes throughclustered_tree::lookup(...)/range(...)and decodes clustered secondary bookmarks back into PK probesUPDATEon clustered tables now routes through clustered candidate discovery plusupdate_in_place(...)/update_with_relocation(...)DELETEon clustered tables now routes through clustered candidate discovery plusdelete_mark(...)and exact-row-image WAL- pending heap batches flush before the clustered statement boundary so the new clustered branch does not inherit heap staging semantics accidentally
SQL-visible clustered maintenance is now partially live:
- clustered
VACUUMnow physically purges safe dead rows and overflow chains ALTER TABLE ... REBUILDnow migrates legacy heap+PRIMARY KEY tables into a fresh clustered root and rebuilt clustered-secondary bookmark roots- clustered standalone
CREATE INDEX/ANALYZEremain later Phase 39 work
Clustered maintenance now includes the first purge path:
VACUUMwalks the clustered leaf chain from the leftmost leaf- safe delete-marked cells are physically removed from clustered leaves
- overflow chains are freed during that purge
- secondary bookmark cleanup uses clustered physical existence after leaf purge, not caller-snapshot visibility
- any secondary root rotation caused by
delete_many_in(...)is persisted back to the catalog in the same transaction - clustered rebuild flushes the newly built clustered / secondary roots before the catalog swap and defers old heap/index page reclamation until commit
WITHOUT ROWID inserts target the PK B-tree directly, and InnoDB treats the clustered key as the row identity. AxiomDB now does the same for SQL-visible clustered INSERT instead of manufacturing a heap row plus compatibility index entry.
Clustered Tree Insert Controller (Phase 39.3)
axiomdb-storage::clustered_tree now builds the first tree-level write path on
top of these page primitives. The public entry point is:
#![allow(unused)]
fn main() {
pub fn insert(
storage: &mut dyn StorageEngine,
root_pid: Option<u64>,
key: &[u8],
row_header: &RowHeader,
row_data: &[u8],
) -> Result<u64, DbError>
}
The controller is still storage-first:
- Bootstrap an empty tree into a
ClusteredLeafroot whenroot_pidisNone. - Descend through
ClusteredInternalpages withfind_child_idx(). - Materialize a clustered leaf descriptor:
- small rows stay fully inline
- large rows keep a local prefix inline and spill the tail bytes to overflow pages
- Insert that descriptor into the target leaf in sorted key order.
- If the descriptor does not fit, defragment once and retry before splitting.
- Split leaves by cumulative cell byte volume, not by cell count.
- Propagate
(separator_key, right_child_pid)upward. - Split internal pages by cumulative separator byte volume and create a new root if the old root overflows.
Split behavior deliberately keeps the old page ID as the left half and allocates only the new right sibling. That matches the current no-concurrent- clustered-writer reality and keeps parent maintenance minimal until the later MVCC/WAL phases wire clustered pages into the full engine.
Since 39.10, rows above the local inline budget are no longer rejected. The
leaf keeps the primary key and RowHeader inline, stores only a bounded local
row prefix on-page, and spills the remaining tail bytes to a dedicated
PageType::Overflow chain.
Clustered Point Lookup (Phase 39.4)
axiomdb-storage::clustered_tree::lookup(...) is now the first read path over
the clustered tree:
#![allow(unused)]
fn main() {
pub fn lookup(
storage: &dyn StorageEngine,
root_pid: Option<u64>,
key: &[u8],
snapshot: &TransactionSnapshot,
) -> Result<Option<ClusteredRow>, DbError>
}
Lookup flow:
- Return
Noneimmediately when the tree has no root. - Descend clustered internal pages with
find_child_idx()andchild_at(). - Run exact-key binary search on the target clustered leaf.
- Read the leaf descriptor
(key, RowHeader, total_row_len, local_prefix, overflow_ptr?). - Apply
RowHeader::is_visible(snapshot). - If the row is overflow-backed, reconstruct the logical row bytes by reading the overflow-page chain.
- Return an owned
ClusteredRowon a visible hit.
In 39.4, lookup is intentionally conservative about invisible rows: when the
current inline version fails MVCC visibility, it returns None instead of
trying to synthesize an older version. Clustered undo/version-chain traversal
for arbitrary snapshots still does not exist; 39.11 adds rollback/savepoint
restore for clustered writes, but not older-version reconstruction on reads.
Clustered Range Scan (Phase 39.5)
axiomdb-storage::clustered_tree::range(...) is now the first ordered multi-row
read path over clustered pages:
#![allow(unused)]
fn main() {
pub fn range<'a>(
storage: &'a dyn StorageEngine,
root_pid: Option<u64>,
from: Bound<Vec<u8>>,
to: Bound<Vec<u8>>,
snapshot: &TransactionSnapshot,
) -> Result<ClusteredRangeIter<'a>, DbError>
}
Range flow:
- Return an empty iterator when the tree is empty or the bound interval is empty.
- For bounded scans, descend to the first relevant leaf with the same clustered internal-page search path used by point lookup.
- For unbounded scans, descend to the leftmost leaf.
- Start at the first in-range slot within that leaf.
- Yield owned
ClusteredRowvalues in primary-key order. - Skip current inline versions that are invisible to the supplied snapshot.
- Follow
next_leafto continue the scan across leaves. - Stop immediately when the first key above the upper bound is seen.
The iterator stays lazy: it keeps only the current leaf page id, slot index, bound copies, and snapshot. It does not materialize the whole range into a temporary vector.
When the iterator advances to another leaf, it calls
StorageEngine::prefetch_hint(next_leaf_pid, 4). The 4-page window is
intentionally conservative: large enough to overlap sequential leaf reads, but
small enough not to flood the page cache while clustered scans are still an
internal storage primitive.
Like 39.4, this subphase is still honest about missing older-version
reconstruction. If a row’s current inline version is invisible, 39.5 skips
it; the new 39.11 rollback support does not change read semantics yet.
Zero-Allocation Full Scan (scan_all_callback, Phase 39.21)
ClusteredRangeIter::next() allocates two heap buffers per row:
cell.key.to_vec() (primary key copy) and reconstruct_row_data (row bytes
copy). For a full-table scan that only needs to decode the row bytes into
Vec<Value>, both allocations are unnecessary.
scan_all_callback bypasses the iterator entirely:
#![allow(unused)]
fn main() {
pub fn scan_all_callback<F>(
storage: &dyn StorageEngine,
root_pid: Option<u64>,
snapshot: &TransactionSnapshot,
mut f: F,
) -> Result<(), DbError>
where
F: FnMut(&[u8], Option<(u64, usize)>) -> Result<(), DbError>,
}
The callback receives (inline_data: &[u8], overflow):
inline_data: a borrow ofcell.row_datadirectly from the leaf page memory — no copy.overflow:Some((first_overflow_page, tail_len))for rows that spill to overflow pages;Nonefor rows that fit inline (the common case for most tables).
For inline rows the callback can decode inline_data in place. The caller
allocates only one Vec<Value> per visible row — the decoded output — compared
to three allocations with the iterator path.
GROUP BY age, AVG(score) query on 50K rows of a clustered table dropped from 57 ms to 4.0 ms (14.25× improvement) after switching from ClusteredRangeIter to scan_all_callback. The bottleneck was ~150K heap allocations per scan (key copy + row copy + Vec<Value>). The callback path eliminates the first two, leaving only the Vec<Value> per row. AxiomDB now runs this query 1.6× faster than MariaDB (6.5 ms) and 2.2× faster than MySQL (8.9 ms) on the same hardware.
Clustered Overflow Pages (Phase 39.10)
Phase 39.10 adds the first overflow-page primitive dedicated to clustered
rows:
Leaf cell:
[key_len: u16]
[total_row_len: u32]
[RowHeader: 24B]
[key bytes]
[local row prefix]
[overflow_first_page?: u64]
Overflow page body:
[next_overflow_page: u64]
[payload bytes...]
The contract is intentionally physical:
- Keep the primary key and
RowHeaderinline in the clustered leaf. - Keep only a bounded local row prefix inline.
- Spill the remaining logical row tail to
PageType::Overflowpages. - Reconstruct the full logical row only on read paths (
lookup,range) or update paths that need the logical bytes. - Let split / merge / rebalance move the physical descriptor without rewriting the overflow payload.
Phase 39.10 itself intentionally did not introduce generic TOAST
references, compression, or crash recovery for overflow chains. 39.11 now
adds in-process clustered WAL/rollback over those row images, but clustered
crash recovery still stays in later phases.
Clustered WAL and Rollback (Phase 39.11)
Phase 39.11 adds the first WAL contract that understands clustered rows:
key = primary-key bytes
old_value = ClusteredRowImage? // exact old row image
new_value = ClusteredRowImage? // exact new row image
Where ClusteredRowImage carries:
- the latest clustered
root_pid - the exact inline
RowHeader - the exact logical row bytes, regardless of whether the row is inline or overflow-backed on page
TxnManager now tracks the latest clustered root per table_id during the
active transaction. Rollback and savepoint undo use that root plus two storage
helpers:
delete_physical_by_key(...)to undo a clustered insertrestore_exact_row_image(...)to undo clustered delete-mark or update
The restore invariant is logical row state, not exact page topology. Split,
merge, or relocate-update may still leave a different physical tree shape after
rollback as long as the old primary key, RowHeader, and row bytes are back.
Phase 39.12 now extends that same contract into clustered crash recovery:
open_with_recovery() undoes in-progress clustered writes by PK + exact row
image, and open() rebuilds committed clustered roots from surviving WAL
history on a clean reopen.
Clustered Update In Place (Phase 39.6)
axiomdb-storage::clustered_tree::update_in_place(...) is now the first
clustered-row write path after insert:
#![allow(unused)]
fn main() {
pub fn update_in_place(
storage: &mut dyn StorageEngine,
root_pid: Option<u64>,
key: &[u8],
new_row_data: &[u8],
txn_id: u64,
snapshot: &TransactionSnapshot,
) -> Result<bool, DbError>
}
Update flow:
- Return
falsewhen the tree is empty, the key is absent, or the current inline version is not visible to the supplied snapshot. - Descend to the owning clustered leaf by primary key.
- Build a new inline
RowHeaderwith:txn_id_created = txn_idtxn_id_deleted = 0row_version = old.row_version + 1
- Materialize a replacement descriptor:
- inline row
- or local-prefix + overflow chain
- Ask the leaf primitive to rewrite that exact cell while preserving key order.
- Persist the leaf if the rewrite stays inside the same page.
- Free the obsolete overflow chain only after a successful physical rewrite.
- Return
HeapPageFullwhen the replacement row would require leaving the current leaf.
The leaf primitive has two rewrite modes:
- overwrite fast path when the replacement encoded cell fits the existing cell budget
- same-leaf rebuild fallback when the row grows, but the leaf can still be rebuilt compactly with the replacement row in place
Neither path changes the primary key, pointer-array order, parent separators, or
next_leaf.
This keeps the subphase honest about what now exists:
- clustered insert
- clustered point lookup
- clustered range scan
- clustered same-leaf update
- clustered delete-mark
And what still does not:
- clustered older-version reconstruction/version chains
- clustered root persistence beyond WAL checkpoint/rotation
- clustered physical purge
- clustered SQL executor integration
Clustered Delete Mark (Phase 39.7)
axiomdb-storage::clustered_tree::delete_mark(...) now adds the first logical
delete path over clustered pages:
#![allow(unused)]
fn main() {
pub fn delete_mark(
storage: &mut dyn StorageEngine,
root_pid: Option<u64>,
key: &[u8],
txn_id: u64,
snapshot: &TransactionSnapshot,
) -> Result<bool, DbError>
}
Delete flow:
- Return
falsewhen the tree is empty, the key is absent, or the current inline version is not visible to the supplied snapshot. - Descend to the owning clustered leaf by primary key.
- Build a replacement
RowHeaderthat preserves:txn_id_createdrow_version_flagsand stampstxn_id_deleted = txn_id.
- Rewrite the exact clustered cell in place while preserving key bytes and row payload bytes.
- Persist the leaf page without changing
next_leafor parent separators.
The important semantic boundary is that clustered delete is currently a
header-state transition, not space reclamation. The physical cell stays on
the leaf page so snapshots older than the delete can still observe it through
the existing RowHeader::is_visible(...) rule.
Clustered Structural Rebalance (Phase 39.8)
axiomdb-storage::clustered_tree::update_with_relocation(...) adds the first
clustered structural-maintenance path:
#![allow(unused)]
fn main() {
pub fn update_with_relocation(
storage: &mut dyn StorageEngine,
root_pid: Option<u64>,
key: &[u8],
new_row_data: &[u8],
txn_id: u64,
snapshot: &TransactionSnapshot,
) -> Result<Option<u64>, DbError>
}
Control flow:
- Validate that the replacement row still fits inline on a clustered leaf.
- Try
update_in_place(...)first. - If the same-leaf rewrite returns
HeapPageFull, reload the visible current row and enter the structural path. - Physically remove the exact clustered cell from the tree.
- Bubble
underfullandmin_changedupward:- repair the parent separator when a non-leftmost child changes its minimum key
- redistribute or merge clustered leaf siblings by encoded byte volume
- redistribute or merge clustered internal siblings while preserving
n keys -> n + 1 children
- Collapse an empty internal root to its only child.
- Reinsert the replacement row with bumped
row_version.
The key design boundary is that 39.8 introduces private structural delete
only for relocate-update. Public clustered delete is still delete_mark(...),
so snapshot-safe purge remains a later concern.
Current limitations:
delete_mark(...)still keeps dead clustered cells inline;39.8does not expose purge to SQL or storage callers yet.- relocate-update still rewrites only the current inline version.
- parent separator repair currently assumes the repaired separator still fits in the existing internal page budget; split-on-separator-repair is deferred.
Clustered Secondary Bookmarks (Phase 39.9)
Phase 39.9 adds the first clustered-first secondary-index layout in
axiomdb-sql/src/clustered_secondary.rs.
The physical key is:
secondary_logical_key ++ missing_primary_key_columns
Where:
secondary_logical_keyis the ordered value vector of the secondary index columns.missing_primary_key_columnsare only the PK columns that are not already present in the secondary key.
That means the physical secondary entry now carries enough information to
recover the owning clustered row by primary key without depending on a heap
RecordId.
The dedicated helpers now provide:
- layout derivation from
(secondary_idx, primary_idx) - encode/decode of bookmark-bearing secondary keys
- logical-prefix bounds without a fixed 10-byte RID suffix
- insert/delete/update maintenance where relocate-only updates become no-ops if the logical secondary key and primary key stay stable
Current boundary:
- this path is not wired into the heap-backed SQL executor yet
- FK enforcement and index-integrity rebuilds still use the old
RecordId-based secondary path - the legacy
RecordIdpayload inaxiomdb-index::BTreeremains only a compatibility artifact for this path
MmapStorage — Memory-Mapped File
MmapStorage uses a hybrid I/O model inspired by SQLite: read-only mmap for reads,
pwrite() for writes. The mmap is opened with memmap2::Mmap (not MmapMut),
making it structurally impossible to write through the mapped region.
Physical file (axiomdb.db):
┌──────────┬──────────┬──────────┬──────────┬──────────┐
│ Page 0 │ Page 1 │ Page 2 │ Page 3 │ ... │
│ (Meta) │ (Data) │ (Index) │ (Data) │ │
└──────────┴──────────┴──────────┴──────────┴──────────┘
↑ ↑ ↓
│ └── read_page(1): copy 16KB from mmap → owned PageRef
└── mmap (read-only, MAP_SHARED)
write_page(3): pwrite() to file descriptor
Read path: mmap + PageRef copy
read_page(page_id) computes mmap_ptr + page_id * 16384, copies 16 KB into a
heap-allocated PageRef, verifies the CRC32c checksum, and returns the owned copy.
The copy cost (~0.5 us from L2/L3 cache) is the same price PostgreSQL pays when
copying a buffer pool page into backend-local memory.
Write path: pwrite() to file descriptor
write_page(page_id, page) calls pwrite() on the underlying file descriptor at
offset page_id * 16384. The mmap (MAP_SHARED) automatically reflects the change
on subsequent reads. Note that a 16 KB pwrite() is not crash-atomic on 4 KB-block
filesystems — the Doublewrite Buffer protects against torn pages.
Flush: doublewrite + fsync
flush() follows a two-phase write protocol:
- Doublewrite phase: all dirty pages (plus pages 0 and 1) are serialized to a
.dwfile and fsynced. This creates a durable copy of the committed state. - Main fsync: the freelist is pwritten (if modified) and the main
.dbfile is fsynced. If this fsync is interrupted by a crash, the.dwfile provides repair data on the next startup. - Cleanup: the
.dwfile is deleted. If deletion fails, the nextopen()finds all pages valid and removes it.
Trade-offs:
- We cannot control which pages stay hot in memory (the OS uses LRU).
- On 32-bit systems, the address space limits the maximum database size. On 64-bit, the address space is effectively unlimited.
PageRefcopies add ~0.5 us per page read vs. direct pointer access, but this eliminates use-after-free risks from mmap remap and page reuse.
Deferred Page Free Queue
When free_page(page_id) is called, the page does not return to the freelist
immediately. Instead it enters an epoch-tagged queue: deferred_frees: Vec<(page_id, freed_at_snapshot)>. Each entry records the snapshot epoch at which the page became
unreachable. release_deferred_frees(oldest_active_snapshot) only releases pages
whose freed_at_snapshot <= oldest_active_snapshot — pages freed more recently remain
queued because a concurrent reader might still hold a snapshot that references them.
Under the current Arc<RwLock<Database>> architecture, flush() passes u64::MAX
(release all) because the writer holds exclusive access and no readers are active.
When snapshot slot tracking is added (Phase 7.8), the actual oldest active snapshot
will be used instead. The queue is capped at 4096 entries with a tracing warning
to detect snapshot leaks.
Doublewrite Buffer
A 16 KB pwrite() is not crash-atomic on any modern filesystem with 4 KB
internal blocks (APFS, ext4, XFS, ZFS). A power failure mid-write leaves a torn
page: the first N×4 KB contain new data, the remainder holds the previous state.
CRC32c detects this corruption on startup, but without a repair source the database
cannot open.
The doublewrite (DW) buffer solves this. Before every flush(), all dirty pages
are serialized to a .dw file alongside the main .db file:
database.db ← main data file
database.db.dw ← doublewrite buffer (transient, exists only during flush)
DW File Format
[Header: 16 bytes]
magic: "AXMDBLWR" (8 bytes)
version: u32 LE = 1
slot_count: u32 LE
[Slots: slot_count × 16,392 bytes each]
page_id: u64 LE
page_data: [u8; 16384]
[Footer: 8 bytes]
file_crc: CRC32c(header || all slots)
sentinel: 0xDEAD_BEEF
Flush Protocol
1. Collect dirty pages + pages 0 and 1 from the mmap
2. Write all to .dw file → single sequential write
3. fsync .dw file ← committed copy durable
4. pwrite freelist to main file
5. fsync main file ← main data durable
6. Delete .dw file ← cleanup (non-fatal on failure)
Startup Recovery
On MmapStorage::open(), if a .dw file exists:
- Validate the DW file (magic, version, size, CRC, sentinel)
- For each slot: read the corresponding page from the main file
- If CRC is invalid (torn page) → restore from DW copy
- fsync the main file → repairs durable
- Delete the
.dwfile
Recovery is idempotent: if interrupted, the DW file is still valid and the next startup reruns recovery. Pages already repaired have valid CRCs and are skipped.
#ib_*.dblwr file for
better sequential I/O and zero impact on the tablespace format. AxiomDB follows this
newer approach: the DW file is sequential-write-only, does not change the main file
format, and requires no migration for existing databases.
Dirty Page Tracking and Targeted Flush
MmapStorage tracks every page written since the last flush() in a
PageDirtyTracker (an in-memory HashSet<u64>). On flush(), instead of
calling mmap.flush() (which issues msync over the entire file), AxiomDB
coalesces the dirty page IDs into contiguous runs and issues one flush_range
call per run.
Coalescing algorithm
PageDirtyTracker::contiguous_runs() sorts the dirty IDs and merges adjacent
IDs into (start_page, run_length) pairs:
#![allow(unused)]
fn main() {
// Dirty pages: {2, 3, 5, 6, 7} → runs: [(2, 2), (5, 3)]
// Byte ranges: [(2*16384, 32768), (5*16384, 49152)]
}
The merge is O(n log n) on the number of dirty pages and produces the minimum
number of msync syscalls for any given dirty set.
Freelist integration
When the freelist changes (alloc_page, free_page), freelist_dirty is set.
On flush(), the freelist bitmap is serialized into page 1 first, and page 1 is
added to the effective flush set even if it was not already in the dirty tracker.
Only after all targeted flushes succeed are freelist_dirty and the dirty
tracker cleared. A partial failure leaves both intact so the next flush() can
retry safely.
Disk-full error classification
Every durable I/O call in flush() (and in create()/grow()) passes its
std::io::Error through classify_io() before returning:
#![allow(unused)]
fn main() {
// axiomdb-core/src/error.rs
pub fn classify_io(err: std::io::Error, operation: &'static str) -> DbError {
// ENOSPC (28) and EDQUOT (69/122) → DbError::DiskFull { operation }
// All other errors → DbError::Io(err)
}
}
When a DiskFull error propagates out of MmapStorage, the server runtime
transitions to read-only degraded mode — all subsequent mutating statements
are rejected immediately without re-entering the storage layer.
Invariants
flush()returnsOk(())only after all dirty pages are durable.- Dirty tracking is cleared only on success — never on failure.
- The freelist page (page 1) is always included when
freelist_dirtyis set, regardless of whether it appears in the tracker. dirty_page_count()always reflects the count since the last successful flush.ENOSPC/EDQUOTerrors are always surfaced asDbError::DiskFull, never silently wrapped inDbError::Io.
Verified Open — Corruption Detection at Startup
MmapStorage::open() validates every allocated page before making the
database available. The startup sequence is:
- Map the file and verify page 0 (meta) — magic, version, page count.
- Load the freelist from page 1 and verify its checksum.
- Scan pages
2..page_count, skipping any page the freelist marks as free. For each allocated page, callread_page_from_mmap()which re-computes the CRC32c of the body and compares it to the storedheader.checksum.
#![allow(unused)]
fn main() {
for page_id in 2..page_count {
if !freelist.is_free(page_id) {
Self::read_page_from_mmap(&mmap, page_id)?;
}
}
}
If any page fails, open() returns DbError::ChecksumMismatch { page_id, expected, got }
immediately. No connection is accepted and no Db handle is returned.
Free pages are skipped because they are never written by the storage engine and therefore have no valid page header or checksum. Scanning them would produce false positives on a freshly created or partially filled database.
Recovery wiring
Both the network server (Database::open) and the embedded handle (Db::open)
route through TxnManager::open_with_recovery() on every reopen:
#![allow(unused)]
fn main() {
let (txn, _recovery) = TxnManager::open_with_recovery(&mut storage, &wal_path)?;
}
This ensures WAL replay runs before the first query is executed, even if the
only change in this subphase is the corruption scan. Bypassing
open_with_recovery() with the older TxnManager::open() was an oversight
that this subphase closes.
MemoryStorage — In-Memory for Tests
MemoryStorage stores pages in a Vec<Box<Page>>. It implements the same
StorageEngine trait as MmapStorage. All unit tests for the B+ Tree, WAL,
and catalog use MemoryStorage, so they run without touching the filesystem.
#![allow(unused)]
fn main() {
let mut storage = MemoryStorage::new();
let id = storage.alloc_page(PageType::Data)?;
let mut page = Page::new(PageType::Data, id);
page.body_mut()[0] = 0xAB;
page.update_checksum();
storage.write_page(id, &page)?;
let read_back = storage.read_page(id)?;
assert_eq!(read_back.body()[0], 0xAB);
}
FreeList — Page Allocation
The FreeList tracks which pages are free using a bitmap. The bitmap is stored in a
dedicated page (or pages, for large databases). Each bit corresponds to one page:
1 = free, 0 = in use.
Allocation
Scans left-to-right for the first 1 bit, clears it, and returns the page ID.
Bitmap: 1110 1101 ...
↑
First free: page 0 (bit 0 = 1)
After allocation: 0110 1101 ...
Deallocation
Sets the bit corresponding to page_id back to 1. Returns
DbError::DoubleFree if the bit was already 1 (guard against bugs in the
caller).
Invariants
- No page appears twice in the freelist.
- No page can be both allocated and in the freelist simultaneously.
- The freelist bitmap is itself stored in allocated pages (and tracked recursively during bootstrap).
Heap Pages — Slotted Format
Table rows (heap tuples) are stored in PageType::Data pages using a slotted page
layout. The slot array grows from the start of the body; tuples grow from the end
toward the center.
Body (16,320 bytes):
┌─────────────────────────────────────────────────────────────┐
│ Slot[0] │ Slot[1] │ ... │ free space │ ... │ Tuple[1] │ Tuple[0] │
└──────────────────────────────────────────────────────────────┘
↑ ↑ ↑
free_start free area free_end (decreases)
free_start points to the first unused byte after the last slot entry.
free_end points to the first byte of the last tuple written (counting from the
end of the body).
SlotEntry — 4 bytes
Offset Size Field
0 2 offset — byte offset of the tuple within the body (0 = empty slot)
2 2 length — total length of the tuple in bytes
A slot with offset = 0 and length = 0 is an empty (deleted) slot. Deleted slots
are reused when the page is compacted (VACUUM, planned Phase 9).
RowHeader — 24 bytes
Every heap tuple begins with a RowHeader that stores MVCC visibility metadata:
Offset Size Field
0 8 xmin — txn_id of the transaction that inserted this row
8 8 xmax — txn_id of the transaction that deleted/updated this row (0 = live)
16 1 deleted — 1 if this row has been logically deleted
17 7 _pad — alignment
Total: 24 bytes
After the RowHeader comes the null bitmap and the encoded column data (see Row Codec).
Null Bitmap in Heap Rows
The null bitmap is stored immediately after the RowHeader. It occupies
ceil(n_cols / 8) bytes. Bit i (zero-indexed) being 1 means column i is NULL.
5 columns → ceil(5/8) = 1 byte = 8 bits (bits 5-7 unused, always 0)
11 columns → ceil(11/8) = 2 bytes
Page 0 — The Meta Page
Page 0 is the PageType::Meta page. It is written during database creation
(bootstrap) and read during open(). Its body contains:
Offset Size Field
0 8 format_version — AxiomDB file format version
8 8 catalog_root_page — Page ID of the catalog root (axiom_tables B+ Tree root)
16 8 freelist_root_page — Page ID of the freelist bitmap root
24 8 next_txn_id — Next transaction ID to assign
32 8 checkpoint_lsn — LSN of the last successful checkpoint
40 rest _reserved — Future extensions
On crash recovery, the checkpoint_lsn tells the WAL reader where to start replaying.
All WAL entries with LSN > checkpoint_lsn and belonging to committed transactions
are replayed.
Batch Delete Operations
AxiomDB implements three optimizations for DELETE workloads that dramatically reduce page I/O and CRC32c computation overhead.
HeapChain::delete_batch()
delete_batch() accepts a slice of (page_id, slot_id) pairs and groups them by
page_id before touching any page. For each unique page it reads the page once,
marks all targeted slots dead in a single pass, then writes the page back once.
Naive per-row delete path (before delete_batch):
for each of N rows:
read_page(page_id) ← 1 read
mark slot dead ← 1 mutation
update_checksum(page) ← 1 CRC32c over 16 KB
write_page(page_id, page) ← 1 write
Total: 3N page operations
Batch path (delete_batch):
group rows by page_id → P unique pages
for each page:
read_page(page_id) ← 1 read
mark all M slots dead ← M mutations (M rows on this page)
update_checksum(page) ← 1 CRC32c (once per page, not per row)
write_page(page_id, page) ← 1 write
Total: 2P page operations
At 200 rows/page, deleting 10,000 rows hits 50 pages. The naive path requires 30,000
page operations; delete_batch() requires 100.
mark_deleted() vs delete_tuple() — Splitting Checksum Work
heap::mark_deleted() is an internal function that stamps the slot as dead without
recomputing the page checksum. delete_tuple() (the single-row public API) calls
mark_deleted() followed immediately by update_checksum() — behavior is unchanged
for callers.
The batch path calls mark_deleted() N times (once per slot on a given page), then
calls update_checksum() exactly once when all slots on that page are done.
#![allow(unused)]
fn main() {
// Single-row path (public, unchanged):
pub fn delete_tuple(page: &mut Page, slot_id: u16) -> Result<(), DbError> {
mark_deleted(page, slot_id)?; // stamp dead
page.update_checksum(); // 1 CRC32c
Ok(())
}
// Batch path (called by delete_batch for each page):
for &slot_id in slots_on_this_page {
mark_deleted(page, slot_id)?; // stamp dead, no checksum
}
page.update_checksum(); // 1 CRC32c for all N slots on this page
}
mark_deleted from update_checksum makes the cost O(P) in the number of pages, not O(N) in the number of rows. The same split was applied to insert_batch in Phase 3.17.
scan_rids_visible()
HeapChain::scan_rids_visible() is a variant of scan_visible() that returns only
(page_id, slot_id) pairs — no row data is decoded or copied.
#![allow(unused)]
fn main() {
pub fn scan_rids_visible(
&self,
storage: &dyn StorageEngine,
snapshot: &TransactionSnapshot,
self_txn_id: u64,
) -> Result<Vec<(u64, u16)>, DbError>
}
This is used by DELETE without a WHERE clause and TRUNCATE TABLE: both operations
need to locate every live slot but neither needs to decode the row’s column values.
Avoiding Vec<u8> allocation for each row’s payload cuts memory allocation to near
zero for full-table deletes.
HeapChain::clear_deletions_by_txn()
clear_deletions_by_txn(txn_id) is the undo helper for WalEntry::Truncate. It
scans the entire heap chain and, for every slot where txn_id_deleted == txn_id,
clears the deletion stamp (sets txn_id_deleted = 0, deleted = 0).
This is used during ROLLBACK and crash recovery when a WalEntry::Truncate must be
undone. The cost is O(P) page reads and writes for P pages in the chain — identical
to a full-table scan. Because recovery and rollback are infrequent relative to inserts
and deletes, this trade-off is acceptable (see WAL internals for the corresponding
WalEntry::Truncate design decision).
All-Visible Page Flag (Optimization A)
What it is
Bit 0 of PageHeader.flags (PAGE_FLAG_ALL_VISIBLE = 0x01). When set, it
asserts that every alive slot on the page was inserted by a committed transaction
and none have been deleted. Sequential scans can skip per-slot MVCC
txn_id_deleted tracking for those pages entirely.
Inspired by PostgreSQL’s all-visible map (src/backend/storage/heap/heapam.c:668),
but implemented as an in-page bit rather than a separate VM file — a single
cache-line read suffices.
API
#![allow(unused)]
fn main() {
pub const PAGE_FLAG_ALL_VISIBLE: u8 = 0x01;
impl Page {
pub fn is_all_visible(&self) -> bool { ... } // reads bit 0 of flags
pub fn set_all_visible(&mut self) { ... } // sets bit 0; caller updates checksum
pub fn clear_all_visible(&mut self) { ... } // clears bit 0; caller updates checksum
}
}
Lazy-set during scan
HeapChain::scan_visible() sets the flag after verifying that all alive slots
on a page satisfy:
txn_id_created <= max_committed(committed transaction)txn_id_deleted == 0(not deleted)
This is a one-time write per page per table lifetime. After the first slow-path scan, every subsequent scan takes the fast path and skips per-slot checks.
Clearing on delete
heap::mark_deleted() clears the flag unconditionally as its very first
mutation — before stamping txn_id_deleted. Both changes land in the same
update_checksum() + write_page() call. There is no window where the flag is
set while a slot is deleted.
Read-only variant for catalog scans
HeapChain::scan_visible_ro() takes &dyn StorageEngine (immutable) and never
sets the flag. Used by CatalogReader and other callers that hold only a shared
reference. Catalog tables are small (a few pages) and not hot enough to warrant
the lazy-set write.
Sequential Scan Prefetch Hint (Optimization C)
What it is
StorageEngine::prefetch_hint(start_page_id, count) — a hint method telling
the backend that pages starting at start_page_id will be read sequentially.
Implementations that do not support prefetch provide a default no-op.
Inspired by PostgreSQL’s read_stream.c adaptive lookahead.
API
#![allow(unused)]
fn main() {
// Default no-op in the trait — all existing backends compile unchanged
fn prefetch_hint(&self, start_page_id: u64, count: u64) {}
}
MmapStorage overrides this with madvise(MADV_SEQUENTIAL) on macOS and Linux:
#![allow(unused)]
fn main() {
#[cfg(any(target_os = "linux", target_os = "macos"))]
fn prefetch_hint(&self, start_page_id: u64, count: u64) {
// SAFETY: ptr derived from live MmapMut, offset < mmap_len verified,
// clamped_len <= mmap_len - offset. madvise is a pure hint.
let _ = unsafe { libc::madvise(ptr, clamped_len, libc::MADV_SEQUENTIAL) };
}
}
count = 0 uses the backend default (PREFETCH_DEFAULT_PAGES = 64, 1 MB).
Call sites
HeapChain::scan_visible(), scan_rids_visible(), and delete_batch() each
call storage.prefetch_hint(root_page_id, 0) once before their scan loop. This
tells the OS kernel to begin async read-ahead for the pages that follow,
overlapping disk I/O with CPU processing of the current page.
When it helps
The hint has measurable impact on cold-cache workloads (data not in OS page
cache). On warm cache (mmap pages already faulted in), madvise is accepted
but the kernel takes no additional action — no performance regression.
Lazy Column Decode (Optimization B)
What it is
decode_row_masked(bytes, schema, mask) — a variant of decode_row that accepts
a boolean mask. When mask[i] == false, the column’s wire bytes are skipped
(cursor advanced, no allocation) and Value::Null is placed in the output slot.
Inspired by PostgreSQL’s selective column access in the executor.
API
#![allow(unused)]
fn main() {
pub fn decode_row_masked(
bytes: &[u8],
schema: &[DataType],
mask: &[bool], // mask.len() must equal schema.len()
) -> Result<Vec<Value>, DbError>
}
For skipped columns:
- Fixed-length types (Bool=1B, Int/Date=4B, BigInt/Real/Timestamp=8B, Decimal=17B, Uuid=16B):
ensure_bytesis called thenposadvances — no allocation. - Variable-length types (Text, Bytes): the 3-byte length prefix is read to advance
posby3 + len— the payload is never copied or parsed. - NULL columns (bitmap bit set): no wire bytes, cursor unchanged regardless of mask.
Column mask computation
The executor computes the mask via collect_column_refs(expr, mask), which walks
the AST and marks every Expr::Column { col_idx } reference. It does not recurse
into subquery bodies (different row scope).
SELECT * (Wildcard/QualifiedWildcard) always produces None — decode_row()
is used directly with no overhead.
When all mask bits are true, scan_table also uses decode_row() directly.
Where it applies
execute_select_ctx(single-table SELECT): mask covers SELECT list + WHERE + ORDER BY + GROUP BY + HAVINGexecute_delete_ctx(DELETE with WHERE): mask covers the WHERE clause only (no-WHERE path usesscan_rids_visible— no decode at all)
Clustered Leaf Page-Buffer Mutation Primitives (Phase 39.22)
Three public primitives in crates/axiomdb-storage/src/clustered_leaf.rs enable
zero-allocation in-place UPDATE for fixed-size columns.
cell_row_data_abs_off
#![allow(unused)]
fn main() {
pub fn cell_row_data_abs_off(page: &Page, cell_idx: usize) -> Result<(usize, usize), DbError>
}
Computes the absolute byte offset of row_data within the page buffer for a
given cell index without decoding the cell. Returns (row_data_abs_off, key_len).
Formula:
row_data_abs_off = HEADER_SIZE + body_off + CELL_META_SIZE + ROW_HEADER_SIZE + key_len
Used by the UPDATE fast path to locate field bytes directly in the page buffer —
no cell.row_data.to_vec() required.
patch_field_in_place
#![allow(unused)]
fn main() {
pub fn patch_field_in_place(page: &mut Page, field_abs_off: usize, new_bytes: &[u8]) -> Result<(), DbError>
}
Overwrites new_bytes.len() bytes at field_abs_off within the page buffer.
Validates that field_abs_off + new_bytes.len() <= PAGE_SIZE. This is the
AxiomDB equivalent of InnoDB’s btr_cur_upd_rec_in_place().
btr_cur_upd_rec_in_place writes only changed bytes within the B-tree page buffer. AxiomDB implements the same technique with a pure-Rust zero-unsafe byte-write primitive. For UPDATE t SET score = score + 1 on a 25K-row clustered table, this reduces per-row work from ~469 bytes (full decode + encode + heap alloc) to ~28 bytes (read field + write field), cutting allocations from 5 per row to zero.
update_row_header_in_place
#![allow(unused)]
fn main() {
pub fn update_row_header_in_place(page: &mut Page, cell_idx: usize, new_header: &RowHeader) -> Result<(), DbError>
}
Overwrites the 24-byte RowHeader at the exact page offset for a given cell.
Used after patch_field_in_place to stamp the new txn_id_created and
incremented row_version without re-encoding the full cell.
Split-Phase Pattern (Rust Borrow Checker Compatibility)
The UPDATE fast path uses a split-phase read/write pattern to satisfy the Rust borrow checker — the immutable page borrow (read phase) must be fully dropped before the mutable borrow (write phase) begins:
#![allow(unused)]
fn main() {
// Read phase: immutable borrow — compute field locations, capture old bytes
let (row_data_abs_off, _) = cell_row_data_abs_off(&page, idx)?;
let (field_writes, any_change) = {
let b = page.as_bytes();
// ... compute loc, capture old_buf: [u8;8], encode new_buf: [u8;8]
// MAYBE_NOP: if old_buf[..loc.size] == new_buf[..loc.size] { skip }
(field_writes_vec, changed)
}; // immutable borrow dropped here
if !any_change { continue; }
// Write phase: mutable borrow — patch page buffer directly
for (field_abs, size, _, new_buf) in &field_writes {
patch_field_in_place(&mut page, *field_abs, &new_buf[..*size])?;
}
update_row_header_in_place(&mut page, idx, &new_header)?;
}
row_data.to_vec() (heap allocation) by keeping field locations in a stack-allocated Vec<(usize, usize, [u8;8], [u8;8])> computed during the immutable phase and consumed during the mutable phase. This is the same invariant InnoDB enforces manually with pointer arithmetic.