Design Decisions
This page documents the most consequential architectural choices made during AxiomDB’s design. Each entry explains the alternatives considered, the reasoning, and the trade-offs accepted.
Query Languages
SQL + AxiomQL dual-language strategy
| Aspect | Decision |
|---|---|
| Chosen | Two query languages sharing one AST and executor |
| Alternatives | SQL only; AxiomQL only; SQL-to-AxiomQL transpiler |
| Phase | Phase 12+ (post wire protocol) |
SQL is the primary language. Full MySQL/PostgreSQL wire protocol compatibility. All ORMs, clients, and tools work without changes. Nothing breaks for anyone.
AxiomQL is an optional alternative — a method-chain query language for developers who prefer modern, readable syntax. It compiles to the same Stmt AST as SQL, so there is zero executor overhead and every SQL feature is automatically available in AxiomQL.
SQL ──────┐
├──► AST ──► Optimizer ──► Executor
AxiomQL ───┘
AxiomQL syntax reads top-to-bottom in the logical order of execution:
users
.filter(active, age > 18)
.join(orders)
.group(country, total: count())
.sort(total.desc)
.take(10)
This is already familiar to any developer who uses .filter().map().sort() in JavaScript, Python, Rust, or C#. The learning curve is ~10 minutes.
Why not SQL-only: SQL’s evaluation order (SELECT before FROM, HAVING separate from WHERE) is a 50-year-old quirk that confuses new users. AxiomQL removes the confusion without removing SQL.
Why not AxiomQL-only: Breaking compatibility with every MySQL client, ORM, and tool in existence would be unacceptable. SQL stays.
No existing database has this combination: ORMs like ActiveRecord and Eloquent are application-layer libraries, not native DB languages. PRQL compiles to SQL externally. EdgeQL is native but a different syntax family. AxiomQL would be the first native method-chain language that coexists with SQL in the same engine.
Storage
mmap over a Custom Buffer Pool
| Aspect | Decision |
|---|---|
| Chosen | memmap2::MmapMut — OS-managed page cache |
| Alternatives | Custom buffer pool (like InnoDB), io_uring direct I/O |
| Phase | Phase 1 (Storage Engine) |
Why mmap:
- The OS page cache provides LRU eviction, readahead prefetching, and dirty page write-back for free. Implementing these correctly in user space takes months of engineering work.
- Pages returned by
read_page()are&Pagereferences directly into the mapped memory — zero copy from kernel to application. - MySQL InnoDB maintains a separate buffer pool on top of the OS page cache. The same physical page lives in RAM twice (once in the kernel page cache, once in the buffer pool). mmap eliminates the second copy.
msync(MS_SYNC)provides the same durability guarantee asfsyncfor WAL and checkpoint flushes.
Trade-offs accepted:
- No fine-grained control over eviction policy (OS uses LRU; a custom pool could use clock-sweep with hot/cold zones).
- On 32-bit systems, mmap is limited by the address space. Not a concern for a modern 64-bit server database.
- mmap I/O errors manifest as
SIGBUSrather thanErr(...). These are handled with a signal handler that convertsSIGBUStoDbError::Io.
16 KB Page Size
| Aspect | Decision |
|---|---|
| Chosen | 16,384 bytes (16 KB) |
| Alternatives | 4 KB (SQLite), 8 KB (PostgreSQL), 8 KB (original db.md spec) |
| Phase | Phase 1 |
Why 16 KB:
- The B+ Tree ORDER constants (ORDER_INTERNAL = 223, ORDER_LEAF = 217) yield a highly efficient fan-out with 16 KB pages. At 4 KB, the order would be ~54 for internal nodes — requiring 4× more page reads for the same number of keys.
- At 16 KB, a tree covering 1 billion rows has depth 4. At 4 KB, depth 5 (25% more I/O for every lookup).
- OS readahead typically prefetches 128–512 KB, making 16 KB the sweet spot: small enough that random access is not wasteful, large enough for sequential workloads.
- 64-byte header leaves 16,320 bytes for the body — a natural fit for the
bytemuck::Podstructs that avoid alignment issues.
Indexing
Copy-on-Write B+ Tree
| Aspect | Decision |
|---|---|
| Chosen | CoW B+ Tree with AtomicU64 root swap |
| Alternatives | Traditional B+ Tree with read-write locks; LSM-tree (like RocksDB); Fractal tree |
| Phase | Phase 2 (B+ Tree) |
Why CoW B+ Tree:
- Readers are completely lock-free. A
SELECTon a billion-row table never blocks any concurrentINSERT,UPDATE, orDELETE. - MVCC is “built in” — readers hold a pointer to the old root and see a consistent snapshot of the tree, exactly as MVCC requires.
- No deadlocks are possible during tree traversal (locks are never held during reads).
- Writes amplify by O(log n) page copies, but at depth 4 this is 4 × 16 KB = 64 KB per insert — acceptable for the target workload (OLTP, not write-heavy OLAP).
Why not LSM:
- LSM-trees have superior write throughput (sequential I/O only) but inferior read performance (must check multiple levels). AxiomDB’s target is OLTP with read-heavy workloads. A B+ Tree point lookup is O(log n) I/Os; an LSM lookup is O(L) compaction levels, each potentially requiring a disk seek.
- Compaction in LSM introduces unpredictable write amplification spikes that are difficult to tune for latency-sensitive OLTP.
next_leaf Not Used in Range Scans
| Aspect | Decision |
|---|---|
| Chosen | Re-traverse from root to find the next leaf on each boundary crossing |
| Alternatives | Keep the next_leaf linked list consistent under CoW |
| Phase | Phase 2 |
Why: Under CoW, next_leaf pointers in old leaf pages point to other old pages
that may have been freed. Maintaining a consistent linked list under CoW requires
copying the previous leaf on every insert near a boundary — but the previous leaf’s
page_id is not known during a top-down write path without additional bookkeeping.
The cost of the adopted solution (O(log n) per leaf boundary) is acceptable: for a 10,000-row range scan across ~47 leaves (217 rows/leaf), there are 46 boundary crossings, each costing 4 page reads = 184 extra page reads. At a measured scan time of 0.61 ms for 10,000 rows, this is within the 45 ms budget by a factor of 73.
Durability
WAL Without Double-Write Buffer
| Aspect | Decision |
|---|---|
| Chosen | WAL with per-page CRC32c; no double-write buffer |
| Alternatives | Double-write buffer (MySQL InnoDB); full page WAL images (PostgreSQL) |
| Phase | Phase 3 (WAL) |
Why no double-write:
- MySQL writes each page twice: once to the doublewrite buffer and once to the actual position. The doublewrite buffer protects against torn writes (partial page writes due to power failure mid-write).
- AxiomDB protects against torn writes with a CRC32c checksum per page. If a page has an invalid checksum on startup, it is reconstructed from the WAL. This requires the WAL to contain the information needed for reconstruction — which it does (the WAL records the full new_value for each UPDATE/INSERT).
- Eliminating the double-write buffer halves the disk writes for every dirty page flush.
Trade-off: Recovery requires reading more WAL data. If many pages are corrupted (e.g., a full power failure after a long write batch), recovery replays more WAL entries. In practice, with modern UPS and filesystem journaling, full-file corruption is rare. The WAL’s CRC32c catches partial writes reliably.
Physical WAL (not Logical WAL)
| Aspect | Decision |
|---|---|
| Chosen | Physical WAL: records (page_id, slot_id, old_bytes, new_bytes) |
| Alternatives | Logical WAL: records SQL-level operations (INSERT INTO t VALUES…) |
| Phase | Phase 3 |
Why physical:
- Recovery is redo-only: replay each committed WAL entry at its exact physical location. No UNDO pass required (uncommitted changes are simply ignored).
- Physical location (page_id, slot_id) allows direct seek to the affected page — O(1) per WAL entry, not O(log n) B+ Tree traversal.
- The WAL key encodes
page_id:8 + slot_id:2in 10 bytes, making the physical location self-contained in the WAL record.
Trade-off: Physical WAL entries are larger than logical ones (they contain the full encoded row bytes, not a SQL expression). For a row with 100 bytes of data, the WAL entry is ~100 + 43 bytes overhead = ~143 bytes. A logical WAL entry might be smaller for simple inserts. However, the simplicity and speed of redo-only physical recovery outweighs the size difference.
SQL Processing
logos for Lexing
| Aspect | Decision |
|---|---|
| Chosen | logos crate — compiled DFA |
| Alternatives | nom combinators; pest PEG; hand-written lexer; lalrpop |
| Phase | Phase 4.2 (SQL Lexer) |
Why logos:
- logos compiles all token patterns (keywords, identifiers, literals) into a single DFA at build time. Runtime cost per character is a table lookup — 1–3 CPU instructions.
- The
ignore(ascii_case)attribute makes keyword matching case-insensitive with no runtime cost (the DFA is built with both cases folded). - Zero-copy:
Ident(&'src str)slices into the input without heap allocation. - Measured throughput: 9–17× faster than sqlparser-rs for the same inputs.
nom is an excellent choice for context-free parsing with backtracking but is over-engineered for a lexer: a lexer is a regular language (no backtracking needed), and DFA is the optimal algorithm for it.
Zero-Copy Tokens
| Aspect | Decision |
|---|---|
| Chosen | Token::Ident(&'src str) — lifetime-tied reference into the input |
| Alternatives | Token::Ident(String) — owned heap allocation; Token::Ident(Arc<str>) |
| Phase | Phase 4.2 |
Why zero-copy:
- Heap allocation per identifier would cost ~30 ns on modern hardware (involving a
malloccall). For a query with 20 identifiers, that is 600 ns of allocation overhead. - At 2M queries/s (the target throughput), 600 ns per query consumes 1.2 s per second of CPU time in allocations — impossible to sustain.
- Zero-copy tokens require the input string to outlive the token stream, which is a natural constraint: the input is always available until the query finishes.
MVCC Implementation
RowHeader in Heap Pages (not Undo Tablespace)
| Aspect | Decision |
|---|---|
| Chosen | MVCC metadata (xmin, xmax, deleted) in each heap row |
| Alternatives | Separate undo tablespace (MySQL InnoDB); version chain in B+ Tree (PostgreSQL MVCC heap) |
| Phase | Phase 3 (TxnManager) |
Why inline RowHeader:
- A historical row version is visible in its original heap location. No additional I/O is needed to read old versions — they are in the same page as the current version.
- MySQL’s undo tablespace (
ibdata1) requires additional I/O for reads that need old row versions (the reader follows a pointer chain from the clustered index into the undo tablespace). - Inline metadata is simpler to implement and audit.
Trade-offs:
- Dead rows occupy space in the heap until
VACUUM(Phase 9) cleans them up. - The
RowHeaderadds 24 bytes overhead per row. For a table with 50-byte average rows, this is 32% overhead. Acceptable for the generality it provides.
Collation
UCA Root as Default Collation
| Aspect | Decision |
|---|---|
| Chosen | Unicode Collation Algorithm (UCA) root for string comparison |
| Alternatives | ASCII byte order; locale-specific collation; C locale (PostgreSQL default) |
| Phase | Phase 4 (Types) |
Why UCA root:
- ASCII byte order (
strcmp) gives incorrect ordering for most non-English text: ‘ä’ sorts after ‘z’ in ASCII, but should sort near ‘a’. - UCA root is locale-neutral (deterministic across any server environment) while still correct for most languages.
- MySQL’s default collation (utf8mb4_general_ci) is not standards-compliant.
- UCA root is implemented by the
icucrate — same algorithm used by modern browsers forIntl.Collator.
WAL Optimization
Per-Page WAL Entries (PageWrite) vs Per-Row WAL Entries
| Aspect | Decision |
|---|---|
| Chosen | EntryType::PageWrite = 9: one WAL entry per heap page for bulk inserts |
| Alternatives | Per-row Insert entries (original approach); full redo log (PostgreSQL WAL) |
| Phase | Phase 3.18 |
Why per-page:
For bulk inserts (INSERT INTO t VALUES (r1),(r2),...), the per-row approach writes one
WAL entry per row: 10,000 rows = 10,000 serialize_into() calls + 10,000 CRC32c
computations. Per-page replaces these with ~42 entries (one per 16 KB page, holding ~240
rows each) — 238× fewer serializations and a 30% smaller WAL file.
The PageWrite entry format stores:
- The full post-modification page bytes (
new_value[0..PAGE_SIZE]) — available for future REDO-based power-failure recovery (Phase 3.8b). - The inserted slot IDs (
new_value[PAGE_SIZE+2..]) — used by crash recovery to undo uncommittedPageWriteentries by marking each slot dead, identical in effect to undoing N individualInsertentries.
Trade-offs accepted:
- Each
PageWriteentry is ~16 KB vs ~100 bytes for anInsertentry. For sparse inserts (a few rows per page),PageWriteis larger. The optimization only applies toinsert_rows_batch()(multi-row INSERT) — single-row inserts still useInsertentries. - Crash recovery must parse the embedded slot list instead of simply reading a single physical location. The parsing is O(num_slots) per entry — still O(N) total, identical asymptotic cost.
Why not a full redo log (like PostgreSQL WAL):
PostgreSQL writes a physical page image + logical redo records for every page modification.
Our PageWrite is a simplified version: we write only the post-image (for bulk inserts)
and rely on the existing in-memory undo log for rollback. Full redo would require per-page
LSNs and a replay pass on startup — reserved for Phase 3.8b.
txn_id_created == crashed_txn_id. We rejected this because it requires reading the page from storage during the crash recovery scan — before the undo phase even begins. Embedding the slot IDs in the PageWrite entry keeps crash recovery a pure WAL read pass: no storage I/O needed to determine what to undo.
Content-Addressed BLOB Storage (Planned Phase 6)
| Aspect | Decision |
|---|---|
| Planned | SHA-256 content address as the BLOB key in a dedicated BLOB store |
| Alternatives | Inline BLOB in the heap (PostgreSQL TOAST); external file reference |
| Phase | Phase 6 |
Why content-addressed:
- Two rows storing the same attachment (e.g., a company logo in every invoice) share exactly one copy on disk. Deduplication is automatic and requires no extra schema.
- The BLOB store is append-only with immutable entries — no locking on BLOB reads.
- Deletion is handled by reference counting: when the last row referencing a BLOB is deleted, the BLOB can be garbage collected.