Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Benchmarks

All benchmarks run on Apple M2 Pro (12 cores), 32 GB RAM, NVMe SSD, single-threaded, warm data (all pages in OS page cache unless noted). Criterion.rs is used for all micro-benchmarks; each measurement is the mean of at least 100 samples.

Reference values for MySQL 8 and PostgreSQL 15 are measured in-process (no network), without WAL for pure codec/parser operations. Operations that include WAL (INSERT, UPDATE) are directly comparable.


SQL Parser

BenchmarkAxiomDBsqlparser-rsMySQL ~PostgreSQL ~Verdict
Simple SELECT (1 table)492 ns4.8 µs~500 ns~450 ns✅ parity with PG
Complex SELECT (multi-JOIN)2.7 µs46 µs~4.0 µs~3.5 µs✅ 1.3× faster than PG
CREATE TABLE1.1 µs14.5 µs~2.5 µs~2.0 µs✅ 1.8× faster than PG
Batch (100 statements)47 µs~90 µs~75 µs✅ 1.6× faster than PG

vs sqlparser-rs: 9.8× faster on simple SELECT, 17× faster on complex SELECT.

The speed advantage comes from two decisions:

  1. logos DFA lexer — compiles token patterns to a Deterministic Finite Automaton at build time. Scanning runs in O(n) time with 1–3 CPU instructions per byte.
  2. Zero-copy tokensIdent tokens are &'src str slices into the original input. No heap allocation occurs during lexing or AST construction.

B+ Tree Index

BenchmarkAxiomDBMySQL ~PostgreSQL ~TargetMax acceptableVerdict
Point lookup (1M rows)1.2M ops/s~830K ops/s~1.1M ops/s800K ops/s600K ops/s
Range scan 10K rows0.61 ms~8 ms~5 ms45 ms60 ms
Insert (sequential keys)195K ops/s~150K ops/s~120K ops/s180K ops/s150K ops/s
Sequential scan 1M rows0.72 s~0.8 s~0.5 s0.8 s1.2 s
Concurrent reads ×16linear~2× degradation~1.5× degradationlinear<2× degradation

Why point lookup is fast: the CoW B+ Tree root is an AtomicU64. Readers load it with Acquire and traverse 3–4 levels of 16 KB pages that are already in the OS page cache. No mutex, no RWLock.

Why range scan is very fast: RangeIter re-traverses from the root to locate each successive leaf after exhausting the current one. With CoW, next_leaf pointers cannot be maintained consistently (a split copies the leaf, leaving the previous leaf’s pointer stale). Tree retraversal costs O(log n) per leaf boundary crossing — at 3–4 levels deep this is 3–5 page reads, all already in the OS page cache for sequential workloads. The deferred next_leaf fast path (Phase 7) will reduce this to O(1) per boundary once epoch-based reclamation is available.

SELECT ... WHERE pk = literal After 6.16

Phase 6.16 fixes the planner gap that still prevented single-table SELECT from using the PRIMARY KEY B+Tree. The executor already supported IndexLookup and IndexRange; the missing piece was planner eligibility plus a forced path for PK equality.

Measured with:

python3 benches/comparison/local_bench.py --scenario select_pk --rows 5000 --table
OperationMariaDB 12.1MySQL 8.0AxiomDB
SELECT * FROM bench_users WHERE id = literal12.7K lookups/s13.4K lookups/s11.1K lookups/s

This closes the old “full scan on PK lookup” debt. The remaining gap is no longer planner-side; it is now in SQL/wire overhead after the PK B+Tree path is already active.


Row Codec

BenchmarkThroughputNotes
encode_row33M rows/s5-column mixed-type row
decode_row28M rows/sSame layout
encoded_lenO(n), no allocSize computation without buffer allocation

The codec encodes a null bitmap (1 bit per column, packed into bytes) followed by the column payloads in declaration order. Variable-length types use a 3-byte (u24) length prefix. Fixed-size types (integers, floats, DATE, TIMESTAMP, UUID) have no length prefix.


Expression Evaluator

BenchmarkAxiomDBMySQL ~PostgreSQL ~Verdict
Expr eval over 1K rows14.8M rows/s~8M rows/s~6M rows/s✅ 1.9× faster than MySQL

The evaluator is a recursive interpreter over the Expr enum. Speed comes from inlining the hot path (column reads, arithmetic, comparisons) and from the fact that col_idx is resolved once by the semantic analyzer — no name lookup at eval time.


Performance Budget

The following thresholds are enforced before any phase is closed. A result below the “Max acceptable” column is a blocker.

OperationAxiomDBTargetMax acceptablePhase measured
Point lookup PK1.2M ops/s800K ops/s600K ops/s2
Range scan 10K rows0.61 ms45 ms60 ms2
B+ Tree INSERT (storage only)195K ops/s180K ops/s150K ops/s3
INSERT end-to-end 10K batch (SchemaCache)36K ops/s ⚠️180K ops/s150K ops/s4.16b
SELECT via wire protocol (autocommit)185 q/s5.14
INSERT via wire protocol (autocommit)58 q/s5.14
Sequential scan 1M rows0.72 s0.8 s1.2 s2
Concurrent reads ×16linearlinear<2× degradation2
Parser — simple SELECT492 ns600 ns1 µs4
Parser — complex SELECT2.7 µs3 µs6 µs4
Row codec encode33M rows/s4
Expr eval (scan 1K rows)14.8M rows/s4

Executor end-to-end (Phase 4.16b, MmapStorage + real WAL, full pipeline)

Measured with cargo bench --bench executor_e2e -p axiomdb-sql (Apple M2 Pro, NVMe, release build). Pipeline: parse → analyze → execute → WAL → MmapStorage.

ConfigurationAxiomDBTarget (Phase 8)Notes
INSERT 100 rows / 1 txn (no SchemaCache)2.8K ops/scold path, catalog scan
INSERT 1K rows / 1 txn (no SchemaCache)18.5K ops/samortization starts
INSERT 1K rows / 1 txn (SchemaCache)20.6K ops/s+8% vs no cache
INSERT 10K rows / 1 txn (SchemaCache)36K ops/s180K ops/s⚠️ WAL bottleneck
INSERT autocommit (1 fsync/row)58 q/s1 fdatasync per statement (wire protocol, Phase 5.14)

Root cause — WAL record_insert() dominates: each row write costs ~20 µs inside record_insert() even without fsync. Parse + analyze cost per INSERT is ~1.5 µs total; SchemaCache eliminates catalog heap scans but only improves throughput by 8% because WAL overhead is already the dominant term. The 180K ops/s target is a Phase 8 goal: prepared statements skip parse and analyze entirely, and a batch insert API will write one WAL entry per batch rather than one per row.

⚙️
Design Decision — WAL per-row write The current WAL implementation writes one entry per inserted row via record_insert(). This makes recovery straightforward — each row is an independent, self-contained undo/redo unit — but costs ~20 µs/row at the WAL layer regardless of fsync. The 36K ops/s ceiling at 10K batch size is a direct consequence of this design. PostgreSQL and MySQL both offer bulk-load paths (COPY, LOAD DATA) that bypass per-row WAL overhead; AxiomDB's equivalent is the Phase 8 batch insert API, which will coalesce WAL entries and write them in a single sequential append.

B+ Tree storage-only INSERT (no SQL parsing, no WAL):

OperationAxiomDBMySQL ~PostgreSQL ~TargetMax acceptableVerdict
B+Tree INSERT (storage only)195K ops/s~150K ops/s~120K ops/s180K ops/s150K ops/s

The storage layer itself exceeds the 180K ops/s target. The gap between 195K (storage only) and 36K (full pipeline) isolates the overhead to the WAL record path, not the B+ Tree or the page allocator.

Run end-to-end benchmarks:

cargo bench --bench executor_e2e -p axiomdb-sql

# MySQL + PostgreSQL comparison (requires Docker):
./benches/comparison/setup.sh
python3 benches/comparison/bench_runner.py --rows 10000
./benches/comparison/teardown.sh

Phase 5.14 — Wire Protocol Throughput

Measured via the MySQL wire protocol (pymysql client, autocommit mode, 1 connection, localhost, Apple M2 Pro, NVMe SSD).

BenchmarkAxiomDBMySQL ~PostgreSQL ~Notes
COM_PING24,865/s~30K/s~25K/sPure protocol, no SQL engine
SET NAMES (intercepted)46,672/s~20K/sHandled in protocol layer
SELECT 1 (autocommit)185 q/s~5K–15K q/s*~5K–12K q/s*Full pipeline, read-only
INSERT (autocommit, 1 fsync/stmt)58 q/s~130–200 q/s*~100–160 q/s*Full pipeline + fsync

*MySQL/PostgreSQL figures are in-process estimates without network latency overhead. AxiomDB throughput measured over localhost with real round-trips; the gap reflects the current single-threaded autocommit path and will improve with Phase 5.13 plan cache and Phase 8 batch API.

Phase 5.14 fix — read-only WAL fsync eliminated:

Prior to Phase 5.14, every autocommit transaction called fdatasync on WAL commit, including read-only queries such as SELECT. This cost 10–20 ms per SELECT, capping throughput at ~56 q/s.

The fix: skip fdatasync (and the WAL flush) when the transaction has no DML operations (undo_ops.is_empty()). Read-only transactions still flush buffered writes to the OS (BufWriter::flush) so that concurrent readers see committed state, but they do not wait for the fdatasync round-trip to persistent storage.

Before / after:

QueryBefore (5.13)After (5.14)Improvement
SELECT 1 (autocommit)~56 q/s185 q/s3.3×
INSERT (autocommit)~58 q/s58 q/sno change (fsync required)
⚙️
Design Decision — read-only fsync skip The WAL commit path gates durability on fdatasync. For DML transactions this is correct — data must reach persistent storage before the client receives OK. For read-only transactions there is nothing to persist: the transaction produced no WAL records. Skipping fdatasync for undo_ops.is_empty() transactions is therefore safe: crash recovery cannot lose data that was never written. PostgreSQL applies the same principle — read-only transactions in PostgreSQL do not touch the WAL at all. The OS-level flush (BufWriter::flush) is kept so that any WAL bytes written by a concurrent writer are visible to the OS before the SELECT returns, preserving read-after-write consistency within the same process.

Bottleneck analysis:

  • SELECT 185 q/s: each COM_QUERY runs a full parse + analyze cycle (~1.5 µs) plus one wire protocol round-trip (~40 µs on localhost). The dominant cost is the round-trip. For prepared statements (COM_STMT_EXECUTE), Phase 5.13 plan cache eliminates the parse/analyze step entirely — the cached AST is reused and only a ~1 µs parameter substitution pass runs before execution. The remaining bottleneck for higher throughput is WAL transaction overhead per statement (BEGIN/COMMIT I/O); this will be addressed by Phase 6 indexed reads (eliminating full-table scans) and the Phase 8 batch API.
  • INSERT 58 q/s: one fdatasync per autocommit statement is required for durability.

Phase 5.21 — Transactional INSERT staging

Measured with python3 benches/comparison/local_bench.py --scenario insert --rows 50000 --table against a release AxiomDB server and local MariaDB/MySQL instances on the same machine. Workload: 50,000 separate one-row INSERT statements inside one explicit transaction.

BenchmarkMariaDB 12.1MySQL 8.0AxiomDBNotes
insert (single-row INSERTs in 1 txn)28.0K rows/s26.7K rows/s23.9K rows/sone BEGIN, 50K INSERT statements, one COMMIT

What changed in 5.21:

  • the session now buffers consecutive eligible INSERT ... VALUES rows for the same table instead of writing heap/WAL immediately
  • barriers such as SELECT, UPDATE, DELETE, DDL, COMMIT, table switch, or ineligible INSERT shapes force a flush
  • the flush uses insert_rows_batch_with_ctx(...) plus grouped post-heap index maintenance, persisting each changed index root once per flush
⚙️
Borrowed Technique AxiomDB borrows the “produce rows first, write them later” pattern from PostgreSQL heap_multi_insert() and DuckDB's appender, but keeps SQL semantics intact by flushing before the next statement savepoint whenever the batch cannot continue.

This is deliberately not the same as autocommit group commit. The benchmark already uses one explicit transaction, so 5.21 attacks per-statement heap/WAL/index work rather than fsync batching across multiple commits.


Phase 6.19 — WAL fsync pipeline

Measured with:

python3 benches/comparison/local_bench.py --scenario insert_autocommit --rows 1000 --table --engines axiomdb

Workload: one INSERT per transaction over the MySQL wire.

BenchmarkAxiomDBTargetStatus
insert_autocommit224 ops/s>= 5,000 ops/s

What changed in 6.19:

  • the old timer-based CommitCoordinator and its config knobs were removed
  • server DML commits now hand deferred durability to an always-on leader-based FsyncPipeline
  • queued followers can piggyback on a leader fsync when their commit_lsn is already covered

What the benchmark taught us:

  • the implementation is correct and wire-visible semantics remain intact
  • but the target workload is sequential request/response autocommit
  • the handler still waits for durability before it sends OK
  • therefore the next statement cannot arrive while the current fsync is in flight, so single-connection piggyback never materializes

6.19 is closed as an implementation subphase, but this benchmark remains a documented performance gap rather than a solved target.

⚙️
Borrowed Technique, Different Constraint AxiomDB borrowed MariaDB's leader/follower fsync idea, but MariaDB's win depends on overlapping arrivals. The local benchmark uses a strictly sequential MySQL client, so the server never has the next autocommit statement in hand while the current fsync is still running.

Phase 6.18 — Indexed multi-row INSERT batch path

Measured with:

python3 benches/comparison/local_bench.py --scenario insert_multi_values --rows 5000 --table

Workload: multi-row INSERT ... VALUES (...), (... ) statements against the benchmark schema with PRIMARY KEY (id).

OperationMariaDB 12.1MySQL 8.0AxiomDB
insert_multi_values on PK table160,581 rows/s259,854 rows/s321,002 rows/s

What changed in 6.18:

  • the immediate multi-row VALUES path no longer checks secondary_indexes.is_empty() before using grouped heap writes
  • grouped heap/index apply was extracted into shared helpers reused by both:
    • the transactional staging flush from 5.21
    • the immediate INSERT ... VALUES (...), (... ) path
  • the immediate path keeps strict UNIQUE semantics by not reusing the staged committed_empty shortcut, because same-statement duplicate keys must still fail without leaking partial rows
🚀
2× Faster Than MariaDB On the PK-only `insert_multi_values` benchmark, AxiomDB reaches 321,002 rows/s vs MariaDB 12.1 at 160,581 rows/s. The gain comes from one grouped heap/index apply per VALUES statement instead of falling back to one heap/index maintenance cycle per row.
⚙️
Design Decision — Share Apply, Keep UNIQUE Strict PostgreSQL's heap_multi_insert() and DuckDB's appender both separate row staging from physical write. AxiomDB borrows the grouped physical apply idea, but rejects a blind bulk-load shortcut on the immediate path: duplicate keys inside one SQL statement must still be rejected before any partial batch becomes visible.

Phase 6.20 — UPDATE apply fast path

Measured with python3 benches/comparison/local_bench.py --scenario update_range --rows 5000 --table against a release AxiomDB server and local MariaDB/MySQL instances on the same machine. Workload: UPDATE bench_users SET score = score + 1 WHERE id BETWEEN ... on a PK-indexed table.

BenchmarkMariaDB 12.1MySQL 8.0AxiomDBNotes
update_range618K rows/s291K rows/s369.9K rows/sPK range UPDATE now stays on a batched read/apply path end-to-end

What changed in 6.20:

  • IndexLookup / IndexRange candidate rows are fetched through read_rows_batch(...) instead of one heap read per RID
  • no-op UPDATE rows are filtered before heap/index mutation
  • stable-RID rows batch UpdateInPlace WAL append with reserve_lsns(...) + write_batch(...)
  • UPDATE index maintenance now uses grouped delete+insert with one root persistence write per affected index
  • both ctx and non-ctx UPDATE paths share a statement-level index bailout

This closes the dominant apply-side debt left behind after 6.17. The benchmark improves by 4.3x over the 6.17 result (85.2K rows/s) and now beats the documented local MySQL result on the same workload.

🚀
Performance Advantage AxiomDB improves `update_range` from 85.2K to 369.9K rows/s after `6.20`, a 4.3x gain, and overtakes MySQL 8.0 (291K rows/s) by keeping PK-range UPDATE inside one batched heap/WAL apply path.
⚙️
Design Decision — Batch Without New WAL Type MariaDB's `row0upd.cc` and PostgreSQL's heap update path optimize clustered-row UPDATE primarily by reducing repeated heap/index work, not by inventing a new redo format first. AxiomDB follows that trade-off in `6.20`: it reuses `UpdateInPlace` for recovery compatibility and attacks the per-row apply overhead around it.

Phase 5.19 / 5.20 — DELETE WHERE and UPDATE Write Paths

Measured with python3 benches/comparison/local_bench.py --scenario all --rows 50000 --table on the same Apple M2 Pro machine. The benchmark uses the MySQL wire protocol and a bench_users table with PRIMARY KEY (id).

OperationMariaDB 12.1MySQL 8.0AxiomDBPostgreSQL 16
DELETE WHERE id > 25000652K rows/s662K rows/s1.13M rows/s3.76M rows/s
UPDATE ... WHERE active = TRUE662K rows/s404K rows/s648K rows/s270K rows/s

5.19 removed the old per-row delete_in(...) loop by batching exact encoded keys per index through delete_many_in(...). 5.20 finished the UPDATE recovery by preserving the original RID whenever the rewritten row still fits in the same slot.

For UPDATE, the before/after delta is the important signal:

  • Post-5.19 / pre-5.20: 52.9K rows/s
  • Post-5.20: 648K rows/s

That is a ~12.2× improvement on the same workload.

🚀
Performance Advantage After `5.20`, AxiomDB's `UPDATE ... WHERE active = TRUE` reaches 648K rows/s, beating MySQL 8 (404K) and PostgreSQL 16 (270K) on the same 50K-row local benchmark. The gain comes from avoiding RID churn and untouched-index rewrites whenever the row still fits in its original heap slot.
⚙️
Design Decision — Two-Step DML Recovery `5.19` and `5.20` fix different write-path costs. Batch-delete removes repeated B+Tree descents for stale keys; stable-RID update removes heap delete+insert and makes index skipping safe. Keeping them as separate subphases made the remaining bottleneck visible after each step.

Phase 5.13 — Prepared Statement Plan Cache

Phase 5.13 introduces an AST-level plan cache for prepared statements. The full parse + analyze pipeline runs once at COM_STMT_PREPARE time; each subsequent COM_STMT_EXECUTE performs only a tree walk to substitute parameter values (~1 µs) and then calls execute_stmt() directly.

PathParse + AnalyzeParam substitutionTotal SQL overhead
COM_QUERY (text protocol)~1.5 µs per call~1.5 µs
COM_STMT_EXECUTE before 5.13~1.5 µs per call (re-parse)string replace~1.5 µs
COM_STMT_EXECUTE after 5.130 (cached)~1 µs AST walk~1 µs

The ~0.5 µs saving per execute is meaningful for high-frequency statement patterns (e.g., ORM-generated queries that re-execute the same SELECT or INSERT with different parameters on every request).

Remaining bottleneck: the dominant cost per COM_STMT_EXECUTE is now the WAL transaction overhead (BEGIN/COMMIT I/O) rather than parse/analyze. For read-only prepared statements, Phase 6 indexed reads will eliminate full-table scans, reducing the per-query execution cost. For write statements, the Phase 8 batch API will coalesce WAL entries, targeting the 180K ops/s budget.

⚙️
Design Decision — AST cache, not string cache The plan cache stores the analyzed Stmt (AST with resolved column indices) rather than the original SQL string. This means each execute avoids both lexing and semantic analysis, not just parsing. The trade-off is that the cached AST must be cloned before parameter substitution to avoid mutating shared state — a shallow clone of the expression tree is ~200 ns, well below the ~1.5 µs that parse + analyze would cost. MySQL and PostgreSQL cache parsed + planned query trees for the same reason.

Running Benchmarks Locally

# B+ Tree
cargo bench --bench btree -p axiomdb-index

# Storage engine
cargo bench --bench storage -p axiomdb-storage

# SQL parser
cargo bench --bench parser -p axiomdb-sql

# All benchmarks
cargo bench --workspace

# Compare before/after a change
cargo bench -- --save-baseline before
# ... make change ...
cargo bench -- --baseline before

# Detailed comparison with critcmp
cargo install critcmp
critcmp before after

Benchmarks use Criterion.rs and emit JSON results to target/criterion/. Each run reports mean, standard deviation, min, max, and throughput (ops/s or bytes/s depending on the benchmark).