Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

B+ Tree — Hybrid Write Model

AxiomDB’s indexing layer is a persistent B+ Tree implemented over the StorageEngine trait. Every index — including primary key and unique constraint indexes — is one such tree.

Write model (Phase 5)

The tree uses a hybrid write model that minimizes page I/O on the hot path while keeping structural operations (splits, merges, rotations) on the safe allocate-new path:

OperationWrite pathAlloc/free
Insert, no leaf splitIn-place: same leaf page ID0 alloc / 0 free
Insert, child split absorbed by non-full parentIn-place: same parent page ID0 alloc / 0 free for the parent
Insert, leaf or internal splitStructural: alloc 2 new pages, free 12 alloc / 1 free
Delete, leaf stays ≥ MIN_KEYS_LEAFIn-place: same leaf page ID0 alloc / 0 free
Delete, parent pointer unchanged after child deleteSkip parent rewrite entirely0 alloc / 0 free for the parent
Delete, leaf underflows → rebalanceStructural: alloc new leaf1 alloc / 1 free
Batch delete, sorted exact keysPage-local merge delete + one parent normalization pass0 alloc / 0 free on non-underfull pages; structural only where underflow happens

This is the Phase 5 model for a serialized single writer (&mut self). Phase 7 will reintroduce the full Copy-on-Write path to reconcile with lock-free readers and epoch reclamation.

⚙️
Hybrid, Not Pure CoW The original Phase 2 design was fully Copy-on-Write — every write allocated a new page and freed the old one. Phase 5.17 introduces in-place writes for non-structural operations because the Phase 5 runtime is single-writer (&mut self on all mutations). Lock-free readers and epoch-based reclamation (Phase 7) will determine how much of the in-place model can be retained under concurrent read traffic.

Batch delete (delete_many_in) — sorted single-pass

Phase 5.19 adds a second delete mode to the tree:

#![allow(unused)]
fn main() {
BTree::delete_many_in(storage, &root_pid, &sorted_keys)
}

The contract is deliberately narrow:

  • the caller already knows the exact encoded keys to delete
  • keys are already sorted ascending
  • the tree does no predicate evaluation and no SQL-layer reasoning

The algorithm is page-local and ordered:

  1. Leaf pages: merge the leaf’s sorted key array with the sorted delete slice and write one compacted survivor image.
  2. Internal pages: partition the delete slice by child range, recurse once per affected child, then normalize the parent once.
  3. Root collapse: run once at the very end of the batch.

This avoids the old N × delete_in(...) pattern where every key started from the root and independently decided whether to rewrite or rebalance the same pages.

⚙️
Design Decision — Ordered Bulk Path The exact-key batch path borrows from PostgreSQL's nbtree bulk-deletion mindset and InnoDB's bulk helpers: once the caller has the full sorted delete set, one ordered walk through the touched pages is safer and cheaper than reusing the point-delete API in a loop.

Page Capacity — Deriving ORDER_INTERNAL and ORDER_LEAF

Both node types must fit within PAGE_BODY_SIZE = 16,320 bytes (16 KB minus the 64-byte header). Each key occupies at most MAX_KEY_LEN = 64 bytes (zero-padded on disk).

Internal Node Capacity

An internal node with n separator keys has n + 1 child pointers.

Header:    1 (is_leaf) + 1 (_pad) + 2 (num_keys) + 4 (_pad)   =   8 bytes
key_lens:  n × 1                                                =   n bytes
children:  (n + 1) × 8                                         = 8n + 8 bytes
keys:      n × 64                                               = 64n bytes

Total = 8 + n + (8n + 8) + 64n = 16 + 73n

Solving 16 + 73n ≤ 16,320:

73n ≤ 16,304
  n ≤ 223.3

ORDER_INTERNAL = 223 (largest integer satisfying the constraint).

Total size: 16 + 73 × 223 = 16 + 16,279 = 16,295 bytes ≤ 16,320 ✓

⚙️
Design Decision — Why 16 KB Pages PAGE_SIZE = 16 KB was chosen to maximize B+ Tree fanout. With 4 KB pages (SQLite's default), ORDER_INTERNAL would be ~54 — requiring 4× more tree levels for the same number of rows, meaning more page reads per lookup. At ORDER_INTERNAL = 223, a billion-row table fits in a 4-level tree requiring only 4 page reads for a point lookup.

Leaf Node Capacity

A leaf node with n entries stores n keys and n record IDs. A RecordId is 10 bytes: page_id (u64, 8 bytes) + slot_id (u16, 2 bytes).

Header:    1 (is_leaf) + 1 (_pad) + 2 (num_keys) + 4 (_pad) + 8 (next_leaf) = 16 bytes
key_lens:  n × 1                                                              =  n bytes
rids:      n × 10                                                             = 10n bytes
keys:      n × 64                                                             = 64n bytes

Total = 16 + n + 10n + 64n = 16 + 75n

Solving 16 + 75n ≤ 16,320:

75n ≤ 16,304
  n ≤ 217.4

ORDER_LEAF = 217 (largest integer satisfying the constraint).

Total size: 16 + 75 × 217 = 16 + 16,275 = 16,291 bytes ≤ 16,320 ✓


On-Disk Page Layout

Both node types use #[repr(C)] structs with all-u8-array fields so that bytemuck::Pod (zero-copy cast) is safe without any implicit padding. All multi-byte fields are stored little-endian.

Internal Node (InternalNodePage)

Offset   Size   Field       Description
──────── ────── ─────────── ─────────────────────────────────────────────
       0      1  is_leaf     always 0
       1      1  _pad0       alignment
       2      2  num_keys    number of separator keys (u16 LE)
       4      4  _pad1       alignment
       8    223  key_lens    actual byte length of each key (0 = empty slot)
     231  1,792  children    224 × [u8;8] — child page IDs (u64 LE each)
   2,023 14,272  keys        223 × [u8;64] — separator keys, zero-padded
──────── ────── ─────────── ──────────────────────────────
Total:  16,295 bytes ≤ PAGE_BODY_SIZE ✓

This fixed-layout page is still the format used by the current production axiomdb-index::BTree. Phase 39 does not mutate this structure in place. Instead, the clustered rewrite is introducing separate storage-layer page primitives for clustered leaves and clustered internal nodes.

Clustered Internal Primitive (Phase 39.2)

The new clustered internal page lives in axiomdb-storage, not in the current axiomdb-index tree code. It uses a slotted variable-size layout:

[ClusteredInternalHeader: 16B]
  is_leaf = 0
  num_cells
  cell_content_start
  freeblock_offset
  leftmost_child
[CellPtr array]
[Free gap]
[Cells: right_child | key_len | key_bytes]

The important compatibility rule is semantic, not structural:

  • separator keys stay sorted
  • find_child_idx(search_key) still returns the first separator strictly greater than the search key
  • logical child 0 comes from leftmost_child
  • logical child i > 0 comes from separator cell i - 1

That lets the clustered storage rewrite preserve B-tree navigation behavior without reusing the old fixed-size MAX_KEY_LEN = 64 layout.

⚙️
Design Decision — Preserve Traversal Contract The clustered rewrite rejects the simpler "just keep a second variable-size child array" approach. Encoding `leftmost_child` in the header and `right_child` in each separator cell keeps page-local mutation simple while preserving the exact traversal semantics the current tree already depends on.

Clustered Insert Controller (Phase 39.3)

Phase 39.3 does not retrofit the current axiomdb-index::BTree into a generic tree over fixed and clustered pages. Instead, axiomdb-storage contains a dedicated controller in clustered_tree.rs that proves the first full write path for clustered pages while the SQL executor still uses the classic heap + index engine.

Algorithm shape:

  1. insert(storage, root_opt, ...) bootstraps a clustered leaf root if needed.
  2. Recursive descent chooses child pointers from ClusteredInternal.
  3. Leaf inserts stay in-place when the physical clustered-row descriptor fits.
  4. Large logical rows use local-prefix + overflow-page descriptors instead of an inline-only reject path.
  5. Fragmented leaves/internal pages call defragment() once before split.
  6. Leaf splits rebuild left/right pages by cumulative cell footprint.
  7. Internal splits rebuild left/right separator sets and promote one separator.
  8. Root overflow creates a fresh ClusteredInternal root.

Unlike the old structural Copy-on-Write tree, clustered 39.3 keeps the old page ID as the left half on split and allocates only the new right sibling. That is a conscious storage-first choice for the current single-writer runtime, not the final concurrency model.

⚙️
Design Decision — Stable Left Page ID The simpler “allocate two fresh pages on every clustered split” option would mimic the old CoW tree, but it would also force needless parent pointer churn. AxiomDB keeps the left page stable and allocates only the new right sibling until Phase 39 reaches WAL, recovery, and executor integration.

Clustered Point Lookup Controller (Phase 39.4)

Phase 39.4 extends that dedicated clustered controller with exact point lookup:

  1. descend internal pages by separator key
  2. search the target leaf by exact key
  3. reconstruct the full logical row from the local leaf payload plus any overflow-page tail
  4. filter the hit through RowHeader::is_visible(snapshot)

The important scope cut is semantic rather than structural: the controller can read the current inline row version, but it cannot yet chase older versions because clustered older-version reconstruction is still future work. 39.11 adds rollback/savepoint restore for clustered writes, but not undo-aware read traversal for arbitrary snapshots.

That means the current lookup(...) contract is:

  • visible hit → Some(ClusteredRow)
  • key absent → None
  • current inline version invisible → None

This is a deliberate intermediate contract for the storage rewrite, not the final SQL-visible clustered read semantics.

⚙️
Design Decision — Exact Tree Search, No Leaf Walk SQLite-style leaf chains are useful for range scans, but point lookup still belongs on the tree path. AxiomDB keeps clustered point reads as true root-to-leaf binary search instead of falling back to a leaf-chain probe that would blur the boundary with Phase 39.5.

Clustered Range Scan Controller (Phase 39.5)

Phase 39.5 adds the first ordered scan controller on top of the clustered pages:

  1. determine the first relevant leaf for the lower bound
  2. determine the first relevant slot inside that leaf
  3. reconstruct and yield visible logical rows in ascending primary-key order
  4. follow next_leaf across the leaf chain
  5. stop as soon as the upper bound is exceeded

The controller is intentionally separate from the old fixed-layout axiomdb-index::RangeIter. The two trees now have different physical layouts and different row payload semantics:

  • classic B+ Tree leaf: (key, RecordId)
  • clustered leaf: (key, RowHeader, total_row_len, local_prefix, overflow_ptr?)

So the right reuse point is the iterator shape, not the implementation.

⚙️
Design Decision — Borrow Cursor Shape, Not Code SQLite's `sqlite3BtreeFirst()` / `sqlite3BtreeNext()` and MariaDB's `read_range_first()` / `read_range_next()` both use a “seek once, then iterate” cursor model. AxiomDB borrows that control flow, but keeps a dedicated clustered iterator instead of forcing the fixed-layout B+ Tree range code to pretend clustered leaves still carry `RecordId`s.

The semantic boundary remains the same as in 39.4: the range iterator can return or skip only the current inline version. Undo-aware older-version reconstruction is still future work.

Clustered Update Controller (Phase 39.6)

Phase 39.6 adds the first mutation path that rewrites an existing clustered row in place:

  1. descend to the owning leaf by exact primary key
  2. check visibility of the current inline version
  3. build the replacement inline header with a bumped row_version
  4. materialize either an inline or overflow-backed replacement descriptor
  5. rewrite the exact leaf cell without changing the key
  6. keep the row in the same leaf or fail explicitly

This controller is intentionally narrower than a full B-tree update:

  • it does not move the row to another leaf
  • it does not split or merge the tree
  • it does not touch parent separators
  • it does not maintain secondary indexes yet

That makes 39.6 a true clustered-storage step, not a disguised merge of 39.6, 39.7, 39.8, and 39.9.

⚙️
Design Decision — Update Is Not Delete Plus Insert The easiest generic implementation would be delete+insert through the whole tree, but that would blur clustered update, delete, split/merge, and secondary-index bookmark work into one phase. AxiomDB keeps 39.6 strict: same-leaf rewrite only, with explicit failure when structural relocation would be needed.

The page-local rewrite itself has two modes:

  • overwrite the existing cell directly when the replacement encoded payload fits the current cell budget
  • rebuild the same leaf compactly when the row grows but still fits on that page

If neither is possible, the controller returns HeapPageFull.

Clustered Delete Controller (Phase 39.7)

Phase 39.7 adds the first logical delete path over clustered rows:

  1. descend to the owning leaf by exact primary key
  2. check visibility of the current inline version
  3. preserve the existing key, row payload, txn_id_created, and row_version
  4. stamp txn_id_deleted = delete_txn_id
  5. persist the same leaf page without structural tree change

This controller is intentionally narrower than a full B-tree delete:

  • it does not remove the physical cell
  • it does not merge or rebalance leaves
  • it does not change parent separators
  • it does not maintain secondary indexes yet

That makes 39.7 the logical-delete companion to 39.6, not a disguised merge of 39.7, 39.8, 39.11, and 39.18.

⚙️
Design Decision — Delete Is Header State First InnoDB's clustered records are delete-marked before purge, and PostgreSQL tuples stay on page until vacuum cleanup. AxiomDB adopts the same sequencing in 39.7: logical delete mutates row state now, while structural cleanup waits for later phases that can prove no snapshot still needs the row.

Clustered Structural Controller (Phase 39.8)

Phase 39.8 adds the first controller that can structurally shrink and rebalance the clustered tree:

  1. call update_in_place(...) as the fast path
  2. on HeapPageFull, load the visible current row
  3. physically delete the exact clustered cell through a private tree path
  4. propagate two signals upward:
    • underfull — the child now needs sibling redistribute/merge
    • min_changed — the child’s minimum key changed and the parent separator must be repaired
  5. rebalance clustered leaf and internal siblings by encoded byte volume
  6. collapse an empty internal root
  7. reinsert the replacement row through the clustered insert controller

That makes 39.8 the structural companion to 39.6 and 39.7, not a shortcut around later purge / undo / secondary-index phases.

⚙️
Design Decision — No Fixed MIN_KEYS The old `axiomdb-index::BTree` can rebalance with fixed `MIN_KEYS_*` thresholds because every slot costs the same. Clustered pages cannot: 39.8 rebalances variable-size siblings by encoded byte volume, following SQLite-style occupancy logic instead of fixed-key-count heuristics.
⚙️
Design Decision — Keep Delete Mark Separate MariaDB InnoDB separates delete-mark from later purge/compression work. AxiomDB keeps that same split in 39.8: physical delete exists only as a private helper for relocate-update, while public clustered delete remains logical until snapshot-safe purge exists.

Current 39.8 limits remain explicit:

  • relocate-update still rewrites only the current inline version
  • public delete still does not purge dead clustered cells
  • parent separator repair does not yet split the parent if the repaired key itself overflows the page budget

Clustered Secondary Bookmarks (Phase 39.9)

Phase 39.9 adds a dedicated bookmark-bearing secondary-key layout in axiomdb-sql::clustered_secondary.

Instead of treating the BTree payload RecordId as the row locator, the clustered path now encodes the physical secondary key as:

secondary_logical_key ++ missing_primary_key_columns

That gives the future clustered executor the exact secondary -> primary key bridge it needs:

  1. scan the secondary B-tree by logical key prefix
  2. decode the appended PK bookmark from the physical secondary key
  3. probe the clustered tree by that primary key
⚙️
Design Decision — Do Not Trust RID Payload MySQL InnoDB and SQLite `WITHOUT ROWID` both treat the table key, not a heap slot, as the durable secondary bookmark. AxiomDB now follows that rule in the clustered path: the old fixed-size `RecordId` payload is kept only because the existing B-tree page format still requires it, not because clustered identity depends on it.

This subphase is intentionally narrower than full executor integration:

  • heap-visible SQL still uses RecordId-based secondaries
  • clustered bookmark scans are a dedicated path, not a replacement for the old planner/executor yet
  • unique clustered secondaries check logical-key conflicts before insert, even though the physical key contains a PK suffix for stable row identity

Clustered Overflow Pages (Phase 39.10)

Phase 39.10 adds the first large-row storage layer for clustered leaves.

The physical clustered leaf cell is now:

[key_len: u16]
[total_row_len: u32]
[RowHeader: 24B]
[key bytes]
[local row prefix]
[overflow_first_page?: u64]

And the overflow-page chain is:

[next_overflow_page: u64]
[payload bytes]

Important invariant:

  • split / merge / rebalance reason about the physical leaf footprint
  • lookup / range reconstruct the logical row bytes only when returning rows
  • the primary key and RowHeader never leave the clustered leaf page
⚙️
Design Decision — Keep MVCC Inline PostgreSQL TOAST and InnoDB off-page columns both move large values away from the main record, but AxiomDB cannot move the clustered `RowHeader` off-page in 39.10 because Phase 39 still performs MVCC visibility checks directly at the leaf. The adaptation is narrower on purpose: spill only the row tail, never the key or header.

This is still narrower than full large-value support:

  • no generic TOAST/BLOB reference layer yet
  • no compression yet
  • 39.11 adds internal WAL / rollback for clustered row images
  • 39.12 adds clustered crash recovery for those row images
  • delete-mark keeps the overflow chain reachable until later purge

Clustered WAL and Recovery (Phases 39.11 / 39.12)

Phases 39.11 and 39.12 add the first clustered durability contract on top of the new page formats:

  1. clustered inserts append EntryType::ClusteredInsert
  2. clustered delete-marks append EntryType::ClusteredDeleteMark
  3. clustered updates append EntryType::ClusteredUpdate
  4. each WAL key is the primary key, not a physical slot identifier
  5. each payload stores an exact ClusteredRowImage
  6. TxnManager tracks the latest clustered root per table_id
  7. rollback and crash recovery restore logical row state by primary key and exact row image

This controller is still intentionally narrower than a full topology-physical recovery story. 39.12 closes clustered crash recovery by reusing the same PK + row-image semantics, while exact root persistence beyond WAL checkpoint/rotation remains future work.

⚙️
Design Decision — B-Tree WAL Is Logical First PostgreSQL's B-tree WAL is page-topology-oriented because page identity is the recovery primitive. AxiomDB rejects that as the first clustered cut: once slotted clustered pages can defragment and relocate rows, the stable identity is the primary key and exact row image, not the old leaf slot.

Rollback therefore promises logical row restoration, not exact physical topology restoration. A relocate-update may leave a different split/merge shape after rollback as long as the old primary-key row is back.

SQL-Visible Clustered INSERT (Phase 39.14)

Phase 39.14 is the first point where the SQL executor writes into the clustered tree instead of only the storage tests doing so.

The executor branch now does this:

  1. resolve the clustered table plus its logical primary index metadata
  2. derive PK bytes from that primary-index column order
  3. check for a visible existing PK through clustered lookup
  4. insert the new row through clustered_tree::insert(...), or reuse a snapshot-invisible delete-marked physical key through restore_exact_row_image(...)
  5. maintain non-primary indexes as secondary_key ++ pk_suffix bookmarks
  6. persist the final clustered table root and any changed secondary roots

Fresh clustered keys emit EntryType::ClusteredInsert. Reused delete-marked physical keys emit EntryType::ClusteredUpdate so rollback can restore the old tombstone image instead of merely deleting the newly-inserted row.

⚙️
Design Decision — Reuse Tombstone Via Update Undo InnoDB can reuse a delete-marked clustered record because undo restores the older image. AxiomDB adopts that clustered-key reuse rule in `39.14`, but logs the reuse as a clustered update rather than a clustered insert so rollback preserves the superseded tombstone image exactly.

This is still narrower than final clustered SQL behavior:

  • clustered UPDATE is now SQL-visible in 39.16
  • clustered DELETE is now SQL-visible as logical delete-mark in 39.17
  • clustered secondary covering reads still normalize back to clustered row fetches until a true clustered index-only optimization exists
  • older-snapshot reconstruction after reusing a tombstoned PK still depends on later clustered version-chain work

Leaf Node (LeafNodePage)

Offset   Size   Field       Description
──────── ────── ─────────── ─────────────────────────────────────────────
       0      1  is_leaf     always 1
       1      1  _pad0       alignment
       2      2  num_keys    number of (key, rid) pairs (u16 LE)
       4      4  _pad1       alignment
       8      8  next_leaf   page_id of the next leaf (u64 LE); u64::MAX = no next
      16    217  key_lens    actual byte length of each key
     233  2,170  rids        217 × [u8;10] — RecordId (page_id:8 + slot_id:2)
   2,403 13,888  keys        217 × [u8;64] — keys, zero-padded
──────── ────── ─────────── ──────────────────────────────
Total:  16,291 bytes ≤ PAGE_BODY_SIZE ✓

Copy-on-Write Root Swap

The root page ID is stored in an AtomicU64. Writers and readers interact with it as follows.

Reader Path

#![allow(unused)]
fn main() {
// Acquire load: guaranteed to see all writes that happened before
// the Release store that set this root.
let root_id = self.root.load(Ordering::Acquire);
let root_page = storage.read_page(root_id)?;
// traverse down — no locks acquired
}

Writer Path

#![allow(unused)]
fn main() {
// 1. Load the current root
let old_root_id = self.root.load(Ordering::Acquire);

// 2. Walk from old_root down to the target leaf, collecting the path
let path = find_path(&storage, old_root_id, key)?;

// 3. For each node on the path (leaf first, then up to root):
//    a. alloc_page → new_page_id
//    b. copy content from old page
//    c. apply the mutation (insert key/split/rebalance)
//    d. update the parent's child pointer to new_page_id

// 4. The new root was written as a new page
let new_root_id = path[0].new_page_id;

// 5. Atomic swap — Release store: all prior writes visible to Acquire loads
self.root.store(new_root_id, Ordering::Release);

// 6. Free the old path pages (only safe after all readers have moved on)
for old_id in old_page_ids { storage.free_page(old_id)?; }
}

A reader that loaded old_root_id before the swap continues accessing old pages safely — they are freed only after all reads complete (tracked in Phase 7 with epoch-based reclamation).

🚀
Lock-Free Reads Readers load the root pointer with Acquire semantics and traverse the tree without acquiring any lock. A write in progress is invisible to readers until the Release store completes — at which point the entire new subtree is already consistent. This is what allows read throughput to scale linearly with core count.

Why next_leaf Is Not Used in Range Scans

The leaf node format includes a next_leaf pointer for a traditional linked-list traversal across leaf nodes. However, this pointer is not used by RangeIter.

Reason: Under CoW, when a leaf is split or modified, a new page is created. The previous leaf in the linked list still points to the old page (L_old), which has already been freed. Keeping the linked list consistent under CoW would require copying the previous leaf on every split — but finding the previous leaf during an insert requires traversing from the root (the tree has no backward pointers).

Adopted solution: RangeIter re-traverses the tree from the root to find the next leaf when crossing a leaf boundary. The cost is O(log n) per boundary crossing, not O(1) as with a linked list. For a tree of 1 billion rows with ORDER_LEAF = 217, the depth is log₂₁₇(10⁹) ≈ 4, so each boundary crossing is 4 page reads. Measured cost for a range scan of 10,000 rows: 0.61 ms — well within the 45 ms budget.

⚙️
Design Decision — Re-traversal vs. Linked-List Leaf Scan The next_leaf pointer exists on-disk but RangeIter does not use it. Under CoW, keeping a consistent linked list would require copying the previous leaf on every split — which itself requires finding that leaf from the root. Re-traversal costs O(log n) per leaf boundary (4 reads at 1B rows) and is simpler to reason about correctly.

Insert — CoW Split Protocol

1. Descend from root to the target leaf, recording the path.

2. If the leaf has room (num_keys < fill_threshold):
   → Copy the leaf to a new page.
   → Insert the new (key, rid) in sorted position.
   → Update the parent's child pointer (CoW the parent too).
   → Propagate CoW up to the root.

3. If the leaf is at or above the fill threshold:
   → Allocate two new leaf pages.
   → Distribute: left gets floor((ORDER_LEAF+1)/2) entries,
                 right gets the remaining entries.
   → The smallest key of the right leaf becomes the separator key
     pushed up to the parent.
   → CoW the parent, insert the new separator and child pointer.
   → If the parent is also full, recursively split upward.
   → If the root splits, allocate a new root with two children.

The split point fill_threshold depends on the index fill factor (see below). Internal pages always split at ORDER_INTERNAL regardless of fill factor.


Fill Factor — Adaptive Leaf Splits

The fill factor controls how full leaf pages are allowed to get before splitting. It is set per-index via WITH (fillfactor=N) on CREATE INDEX and stored in IndexDef.fillfactor: u8.

Formula

fill_threshold(order, ff) = ⌈order × ff / 100⌉   (integer ceiling division)
fillfactorfill_threshold (ORDER_LEAF = 216)Effect
100 (compact)216Splits only when completely full — max density, slowest inserts on busy pages
90 (default)195Leaves ~10% free — balances density and insert speed
70 (write-heavy)152Leaves ~30% free — fewer splits for append-heavy workloads
10 (minimum)22Very sparse pages — extreme fragmentation, rarely useful

A compile-time assert verifies that fill_threshold(ORDER_LEAF, 100) == ORDER_LEAF, ensuring fillfactor=100 always preserves the original behavior exactly.

⚙️
Design Decision Internal pages are not affected by fill factor — they always split at ORDER_INTERNAL. Only leaf splits benefit from the extra free space, because inserts always land on leaf pages. Applying fill factor to internal pages would reduce tree fan-out without any benefit for typical insert patterns, matching PostgreSQL's implementation of the same concept.

Catalog field

IndexDef.fillfactor is serialized as a single byte appended after the predicate section in the catalog heap entry. Pre-6.8 index rows are read with a default of 90 (backward-compatible). Valid range: 10–100; values outside this range are rejected at CREATE INDEX parse time with a ParseError.

When to use a lower fill factor

  • Append-heavy tables — rows inserted in bulk after the index is created. A fill factor of 70–80 prevents cascading splits during the bulk load.
  • Write-heavy OLTP — high-frequency single-row inserts that land on the same hot pages. More free space means fewer page splits per second.
  • Read-heavy / archival — use fillfactor=100. Maximum density reduces I/O for full scans at the cost of slower writes.

Minimum Occupancy Invariant

All nodes except the root must remain at least half full after any operation:

  • Internal nodes: num_keys ≥ ORDER_INTERNAL / 2 = 111
  • Leaf nodes: num_keys ≥ ORDER_LEAF / 2 = 108

Violations of this invariant during delete trigger rebalancing (redistribution from a sibling if possible, merge otherwise).

rotate_right key-shift invariant

When rotate_right borrows the last key of the left sibling and inserts it at position 0 of the underfull child (internal node case), all existing keys in the child must be shifted right by one position before inserting the new key.

The shift must cover positions 0..cn1..cn+1, implemented with a reverse loop (same pattern as insert_at). Using slice::rotate_right(1) on [..cn] is incorrect: it moves key[cn-1] to position 0 where it is immediately overwritten, leaving position cn with stale data from a previous operation. The stale byte can exceed MAX_KEY_LEN = 64, causing a bounds panic on the next traversal of that node.

#![allow(unused)]
fn main() {
// Correct: explicit reverse loop
for i in (0..cn).rev() {
    child.key_lens[i + 1] = child.key_lens[i];
    child.keys[i + 1]     = child.keys[i];
}
child.key_lens[0] = sep_len;
child.keys[0]     = sep_key;
}

Prefix Compression — In-Memory Only

Internal node keys are often highly redundant. For a tree indexing sequential IDs, consecutive separator keys share long common prefixes. AxiomDB implements CompressedNode as an in-memory representation:

#![allow(unused)]
fn main() {
struct CompressedNode {
    prefix: Box<[u8]>,          // longest common prefix of all keys in this node
    suffixes: Vec<Box<[u8]>>,   // remainder of each key after stripping the prefix
}
}

When an internal node page is read from disk, it is optionally decompressed into a CompressedNode for faster binary search (searching on suffix bytes only). When the node is written back, the full keys are reconstructed. This is a read optimization only — the on-disk format always stores full keys.

The compression ratio depends on key structure. For an 8-byte integer key, there is no prefix to compress. For a 64-byte composite key (category_id || product_name), the category_id prefix is shared across many consecutive keys and is compressed away.


Tree Height and Fan-Out

RowsTree depthNotes
1–2171 (root = leaf)Entire tree is one leaf page
218–47,0892One root internal + up to 218 leaves
47K–10.2M3Two levels of internals
10.2M–2.22B4Covers billion-row tables comfortably
>2.22B5Rare; still fast at O(log n) traversal

A tree of 1 billion rows has depth 4 — a point lookup requires reading 4 pages (1 per level). At 16 KB pages, a warm cache point lookup is ~4 memory accesses with no disk I/O.


Static API — Shared-Storage Operations (Phase 6.2)

BTree normally owns its Box<dyn StorageEngine>. This is convenient for tests but prevents sharing one MmapStorage between the table heap and multiple indexes. Phase 6.2 adds static functions that accept an external &mut dyn StorageEngine:

#![allow(unused)]
fn main() {
// Point lookup — read-only, no ownership needed
BTree::lookup_in(storage: &dyn StorageEngine, root_pid: u64, key: &[u8])
    -> Result<Option<RecordId>, DbError>

// Insert — mutates storage, updates root_pid atomically on root split
BTree::insert_in(storage: &mut dyn StorageEngine, root_pid: &AtomicU64, key: &[u8], rid: RecordId)
    -> Result<(), DbError>

// Delete — mutates storage, updates root_pid atomically on root collapse
BTree::delete_in(storage: &mut dyn StorageEngine, root_pid: &AtomicU64, key: &[u8])
    -> Result<bool, DbError>

// Batch delete — removes many pre-sorted keys in one left-to-right pass (5.19)
BTree::delete_many_in(storage: &mut dyn StorageEngine, root_pid: &AtomicU64, keys: &[Vec<u8>])
    -> Result<(), DbError>

// Range scan — collects all (RecordId, key_bytes) in [lo, hi] into a Vec
BTree::range_in(storage: &dyn StorageEngine, root_pid: u64, lo: Option<&[u8]>, hi: Option<&[u8]>)
    -> Result<Vec<(RecordId, Vec<u8>)>, DbError>
}

These delegate to the same private helpers as the owned API. The insert_in and delete_in variants use AtomicU64::store(Release) instead of compare_exchange (safe in Phase 6 — single writer).

Batch delete primitive (delete_many_in) — subphase 5.19

delete_many_in accepts a slice of pre-sorted encoded keys and removes all of them from one index in a single left-to-right tree traversal. The caller is responsible for sorting keys ascending before the call; the primitive enforces this as a precondition.

Algorithm:

  1. batch_delete_subtree(root) — dispatches on node type.
  2. Leaf node: binary-search the sorted keys against the leaf’s key array. Remove all matching slots in one pass, compact in-place, write the page once. If the leaf becomes underfull, signal the parent for merge/redistribute.
  3. Internal node: binary-partition the key slice by separator keys so each child subtree receives only the keys that fall within its range. Recurse into each child that has at least one key to remove. After all children return, rewrite the internal node once if any child pid or separator changed; skip the rewrite otherwise.
  4. After the recursive pass, root_pid is updated atomically once via AtomicU64::store(Release).

Invariants preserved:

  • Tree height stays balanced (leaf depth is uniform after the pass).
  • In-place fast path from 5.17 is reused: leaf and internal rewrites skip page alloc/free when the node fits in the same page.
  • Root is persisted exactly once per delete_many_in call regardless of how many keys were removed.
🚀
Performance Advantage A `DELETE WHERE` touching N rows previously called `BTree::delete_in` N times, descending from the root on each call — O(N log N) page reads and writes total. `delete_many_in` descends the tree once, partitioning the sorted key set at each internal node, yielding O(N + H·B) work where H is tree height and B is the branching factor. At 5,000 rows this eliminates 5,000 separate root descents per index. InnoDB defers this cost via its change buffer; AxiomDB eliminates it upfront with a single sorted pass — no background merge worker required.
⚙️
Design Decision range_in returns Vec<(RecordId, Vec<u8>)> rather than an iterator to avoid lifetime conflicts between the borrow of storage needed to drive the iterator and the caller's existing `&mut storage` borrow. The heap reads happen after the range scan completes, which requires full ownership of the results.

Order-Preserving Key Encoding (Phase 6.1b)

Secondary index keys are encoded as byte slices in axiomdb-sql/src/key_encoding.rs such that encode(a) < encode(b) iff a < b under SQL comparison semantics. Each Value variant is prefixed with a 1-byte type tag:

TypeTagPayloadOrder property
NULL0x00noneSorts before all non-NULL
Bool0x011 bytefalse < true
Int(i32)0x028 BE bytes after n ^ i64::MINNegative < positive
BigInt(i64)0x038 BE bytes after n ^ i64::MINNegative < positive
Real(f64)0x048 bytes (NaN=0, pos=MSB set, neg=all flipped)IEEE order
Decimal(i128, u8)0x051 (scale) + 16 BE bytes after sign-flip
Date(i32)0x068 BE bytes after sign-flip
Timestamp(i64)0x078 BE bytes after sign-flipOlder < newer
Text0x08NUL-terminated UTF-8, 0x00 escaped as [0xFF, 0x00]Lexicographic
Bytes0x09NUL-terminated, same escapeLexicographic
Uuid0x0A16 raw bytesLexicographic

For composite keys the encodings are concatenated — the first column has the most significant sort influence.

NULL handling: NULL values are not inserted into secondary index B-Trees. This is consistent with SQL semantics (NULL ≠ NULL) and avoids DuplicateKey errors when multiple NULLs appear in a UNIQUE index. WHERE col = NULL always falls through to a full scan.

Maximum key length: 768 bytes. Keys exceeding this return DbError::IndexKeyTooLong and are silently skipped during CREATE INDEX.

⚙️
Design Decision Integer sign-flip (`n ^ i64::MIN`) converts a signed two's-complement integer into an unsigned value that sorts in the same order. This is the same technique used by RocksDB's `WriteBatchWithIndex`, CockroachDB's key encoding, and PostgreSQL's `btint4cmp` — proven correct and branch-free at O(1).