Benchmarks
All benchmarks run on Apple M2 Pro (12 cores), 32 GB RAM, NVMe SSD, single-threaded, warm data (all pages in OS page cache unless noted). Criterion.rs is used for all micro-benchmarks; each measurement is the mean of at least 100 samples.
Reference values for MySQL 8 and PostgreSQL 15 are measured in-process (no network), without WAL for pure codec/parser operations. Operations that include WAL (INSERT, UPDATE) are directly comparable.
SQL Parser
| Benchmark | AxiomDB | sqlparser-rs | MySQL ~ | PostgreSQL ~ | Verdict |
|---|---|---|---|---|---|
| Simple SELECT (1 table) | 492 ns | 4.8 µs | ~500 ns | ~450 ns | ✅ parity with PG |
| Complex SELECT (multi-JOIN) | 2.7 µs | 46 µs | ~4.0 µs | ~3.5 µs | ✅ 1.3× faster than PG |
| CREATE TABLE | 1.1 µs | 14.5 µs | ~2.5 µs | ~2.0 µs | ✅ 1.8× faster than PG |
| Batch (100 statements) | 47 µs | — | ~90 µs | ~75 µs | ✅ 1.6× faster than PG |
vs sqlparser-rs: 9.8× faster on simple SELECT, 17× faster on complex SELECT.
The speed advantage comes from two decisions:
- logos DFA lexer — compiles token patterns to a Deterministic Finite Automaton at build time. Scanning runs in O(n) time with 1–3 CPU instructions per byte.
- Zero-copy tokens —
Identtokens are&'src strslices into the original input. No heap allocation occurs during lexing or AST construction.
B+ Tree Index
| Benchmark | AxiomDB | MySQL ~ | PostgreSQL ~ | Target | Max acceptable | Verdict |
|---|---|---|---|---|---|---|
| Point lookup (1M rows) | 1.2M ops/s | ~830K ops/s | ~1.1M ops/s | 800K ops/s | 600K ops/s | ✅ |
| Range scan 10K rows | 0.61 ms | ~8 ms | ~5 ms | 45 ms | 60 ms | ✅ |
| Insert (sequential keys) | 195K ops/s | ~150K ops/s | ~120K ops/s | 180K ops/s | 150K ops/s | ✅ |
| Sequential scan 1M rows | 0.72 s | ~0.8 s | ~0.5 s | 0.8 s | 1.2 s | ✅ |
| Concurrent reads ×16 | linear | ~2× degradation | ~1.5× degradation | linear | <2× degradation | ✅ |
Why point lookup is fast: the CoW B+ Tree root is an AtomicU64. Readers load it
with Acquire and traverse 3–4 levels of 16 KB pages that are already in the OS page
cache. No mutex, no RWLock.
Why range scan is very fast: RangeIter re-traverses from the root to locate
each successive leaf after exhausting the current one. With CoW, next_leaf pointers
cannot be maintained consistently (a split copies the leaf, leaving the previous leaf’s
pointer stale). Tree retraversal costs O(log n) per leaf boundary crossing — at 3–4
levels deep this is 3–5 page reads, all already in the OS page cache for sequential
workloads. The deferred next_leaf fast path (Phase 7) will reduce this to O(1) per
boundary once epoch-based reclamation is available.
SELECT ... WHERE pk = literal After 6.16
Phase 6.16 fixes the planner gap that still prevented single-table SELECT
from using the PRIMARY KEY B+Tree. The executor already supported IndexLookup
and IndexRange; the missing piece was planner eligibility plus a forced path
for PK equality.
Measured with:
python3 benches/comparison/local_bench.py --scenario select_pk --rows 5000 --table
| Operation | MariaDB 12.1 | MySQL 8.0 | AxiomDB |
|---|---|---|---|
SELECT * FROM bench_users WHERE id = literal | 12.7K lookups/s | 13.4K lookups/s | 11.1K lookups/s |
This closes the old “full scan on PK lookup” debt. The remaining gap is no longer planner-side; it is now in SQL/wire overhead after the PK B+Tree path is already active.
Row Codec
| Benchmark | Throughput | Notes |
|---|---|---|
encode_row | 33M rows/s | 5-column mixed-type row |
decode_row | 28M rows/s | Same layout |
encoded_len | O(n), no alloc | Size computation without buffer allocation |
The codec encodes a null bitmap (1 bit per column, packed into bytes) followed by the column payloads in declaration order. Variable-length types use a 3-byte (u24) length prefix. Fixed-size types (integers, floats, DATE, TIMESTAMP, UUID) have no length prefix.
Expression Evaluator
| Benchmark | AxiomDB | MySQL ~ | PostgreSQL ~ | Verdict |
|---|---|---|---|---|
| Expr eval over 1K rows | 14.8M rows/s | ~8M rows/s | ~6M rows/s | ✅ 1.9× faster than MySQL |
The evaluator is a recursive interpreter over the Expr enum. Speed comes from
inlining the hot path (column reads, arithmetic, comparisons) and from the fact
that col_idx is resolved once by the semantic analyzer — no name lookup at eval time.
Performance Budget
The following thresholds are enforced before any phase is closed. A result below the “Max acceptable” column is a blocker.
| Operation | AxiomDB | Target | Max acceptable | Phase measured |
|---|---|---|---|---|
| Point lookup PK | 1.2M ops/s ✅ | 800K ops/s | 600K ops/s | 2 |
| Range scan 10K rows | 0.61 ms ✅ | 45 ms | 60 ms | 2 |
| B+ Tree INSERT (storage only) | 195K ops/s ✅ | 180K ops/s | 150K ops/s | 3 |
| INSERT end-to-end 10K batch (SchemaCache) | 36K ops/s ⚠️ | 180K ops/s | 150K ops/s | 4.16b |
| SELECT via wire protocol (autocommit) | 185 q/s ✅ | — | — | 5.14 |
| INSERT via wire protocol (autocommit) | 58 q/s — | — | — | 5.14 |
| Sequential scan 1M rows | 0.72 s ✅ | 0.8 s | 1.2 s | 2 |
| Concurrent reads ×16 | linear ✅ | linear | <2× degradation | 2 |
| Parser — simple SELECT | 492 ns ✅ | 600 ns | 1 µs | 4 |
| Parser — complex SELECT | 2.7 µs ✅ | 3 µs | 6 µs | 4 |
| Row codec encode | 33M rows/s ✅ | — | — | 4 |
| Expr eval (scan 1K rows) | 14.8M rows/s ✅ | — | — | 4 |
Executor end-to-end (Phase 4.16b, MmapStorage + real WAL, full pipeline)
Measured with cargo bench --bench executor_e2e -p axiomdb-sql (Apple M2 Pro, NVMe,
release build). Pipeline: parse → analyze → execute → WAL → MmapStorage.
| Configuration | AxiomDB | Target (Phase 8) | Notes |
|---|---|---|---|
| INSERT 100 rows / 1 txn (no SchemaCache) | 2.8K ops/s | — | cold path, catalog scan |
| INSERT 1K rows / 1 txn (no SchemaCache) | 18.5K ops/s | — | amortization starts |
| INSERT 1K rows / 1 txn (SchemaCache) | 20.6K ops/s | — | +8% vs no cache |
| INSERT 10K rows / 1 txn (SchemaCache) | 36K ops/s | 180K ops/s | ⚠️ WAL bottleneck |
| INSERT autocommit (1 fsync/row) | 58 q/s | — | 1 fdatasync per statement (wire protocol, Phase 5.14) |
Root cause — WAL record_insert() dominates: each row write costs ~20 µs inside
record_insert() even without fsync. Parse + analyze cost per INSERT is ~1.5 µs total;
SchemaCache eliminates catalog heap scans but only improves throughput by 8% because WAL
overhead is already the dominant term. The 180K ops/s target is a Phase 8 goal: prepared
statements skip parse and analyze entirely, and a batch insert API will write one WAL entry
per batch rather than one per row.
record_insert(). This makes recovery straightforward — each row is an
independent, self-contained undo/redo unit — but costs ~20 µs/row at the WAL layer
regardless of fsync. The 36K ops/s ceiling at 10K batch size is a direct consequence of
this design. PostgreSQL and MySQL both offer bulk-load paths (COPY, LOAD DATA) that bypass
per-row WAL overhead; AxiomDB's equivalent is the Phase 8 batch insert API, which will
coalesce WAL entries and write them in a single sequential append.
B+ Tree storage-only INSERT (no SQL parsing, no WAL):
| Operation | AxiomDB | MySQL ~ | PostgreSQL ~ | Target | Max acceptable | Verdict |
|---|---|---|---|---|---|---|
| B+Tree INSERT (storage only) | 195K ops/s | ~150K ops/s | ~120K ops/s | 180K ops/s | 150K ops/s | ✅ |
The storage layer itself exceeds the 180K ops/s target. The gap between 195K (storage only) and 36K (full pipeline) isolates the overhead to the WAL record path, not the B+ Tree or the page allocator.
Run end-to-end benchmarks:
cargo bench --bench executor_e2e -p axiomdb-sql
# MySQL + PostgreSQL comparison (requires Docker):
./benches/comparison/setup.sh
python3 benches/comparison/bench_runner.py --rows 10000
./benches/comparison/teardown.sh
Phase 5.14 — Wire Protocol Throughput
Measured via the MySQL wire protocol (pymysql client, autocommit mode, 1 connection, localhost, Apple M2 Pro, NVMe SSD).
| Benchmark | AxiomDB | MySQL ~ | PostgreSQL ~ | Notes |
|---|---|---|---|---|
| COM_PING | 24,865/s | ~30K/s | ~25K/s | Pure protocol, no SQL engine |
| SET NAMES (intercepted) | 46,672/s | ~20K/s | — | Handled in protocol layer |
| SELECT 1 (autocommit) | 185 q/s | ~5K–15K q/s* | ~5K–12K q/s* | Full pipeline, read-only |
| INSERT (autocommit, 1 fsync/stmt) | 58 q/s | ~130–200 q/s* | ~100–160 q/s* | Full pipeline + fsync |
*MySQL/PostgreSQL figures are in-process estimates without network latency overhead. AxiomDB throughput measured over localhost with real round-trips; the gap reflects the current single-threaded autocommit path and will improve with Phase 5.13 plan cache and Phase 8 batch API.
Phase 5.14 fix — read-only WAL fsync eliminated:
Prior to Phase 5.14, every autocommit transaction called fdatasync on WAL commit,
including read-only queries such as SELECT. This cost 10–20 ms per SELECT, capping
throughput at ~56 q/s.
The fix: skip fdatasync (and the WAL flush) when the transaction has no DML operations
(undo_ops.is_empty()). Read-only transactions still flush buffered writes to the OS
(BufWriter::flush) so that concurrent readers see committed state, but they do not
wait for the fdatasync round-trip to persistent storage.
Before / after:
| Query | Before (5.13) | After (5.14) | Improvement |
|---|---|---|---|
| SELECT 1 (autocommit) | ~56 q/s | 185 q/s | 3.3× |
| INSERT (autocommit) | ~58 q/s | 58 q/s | no change (fsync required) |
fdatasync. For DML transactions
this is correct — data must reach persistent storage before the client receives OK.
For read-only transactions there is nothing to persist: the transaction produced no WAL
records. Skipping fdatasync for undo_ops.is_empty() transactions
is therefore safe: crash recovery cannot lose data that was never written. PostgreSQL applies
the same principle — read-only transactions in PostgreSQL do not touch the WAL at all.
The OS-level flush (BufWriter::flush) is kept so that any WAL bytes written by
a concurrent writer are visible to the OS before the SELECT returns, preserving read-after-write
consistency within the same process.
Bottleneck analysis:
- SELECT 185 q/s: each
COM_QUERYruns a full parse + analyze cycle (~1.5 µs) plus one wire protocol round-trip (~40 µs on localhost). The dominant cost is the round-trip. For prepared statements (COM_STMT_EXECUTE), Phase 5.13 plan cache eliminates the parse/analyze step entirely — the cached AST is reused and only a ~1 µs parameter substitution pass runs before execution. The remaining bottleneck for higher throughput is WAL transaction overhead per statement (BEGIN/COMMIT I/O); this will be addressed by Phase 6 indexed reads (eliminating full-table scans) and the Phase 8 batch API. - INSERT 58 q/s: one
fdatasyncper autocommit statement is required for durability.
Phase 5.21 — Transactional INSERT staging
Measured with python3 benches/comparison/local_bench.py --scenario insert --rows 50000 --table
against a release AxiomDB server and local MariaDB/MySQL instances on the same
machine. Workload: 50,000 separate one-row INSERT statements inside one
explicit transaction.
| Benchmark | MariaDB 12.1 | MySQL 8.0 | AxiomDB | Notes |
|---|---|---|---|---|
insert (single-row INSERTs in 1 txn) | 28.0K rows/s | 26.7K rows/s | 23.9K rows/s | one BEGIN, 50K INSERT statements, one COMMIT |
What changed in 5.21:
- the session now buffers consecutive eligible
INSERT ... VALUESrows for the same table instead of writing heap/WAL immediately - barriers such as
SELECT,UPDATE,DELETE, DDL,COMMIT, table switch, or ineligible INSERT shapes force a flush - the flush uses
insert_rows_batch_with_ctx(...)plus grouped post-heap index maintenance, persisting each changed index root once per flush
heap_multi_insert() and DuckDB's appender, but keeps SQL semantics intact by
flushing before the next statement savepoint whenever the batch cannot continue.
This is deliberately not the same as autocommit group commit. The benchmark
already uses one explicit transaction, so 5.21 attacks per-statement heap/WAL/index
work rather than fsync batching across multiple commits.
Phase 6.19 — WAL fsync pipeline
Measured with:
python3 benches/comparison/local_bench.py --scenario insert_autocommit --rows 1000 --table --engines axiomdb
Workload: one INSERT per transaction over the MySQL wire.
| Benchmark | AxiomDB | Target | Status |
|---|---|---|---|
insert_autocommit | 224 ops/s | >= 5,000 ops/s | ❌ |
What changed in 6.19:
- the old timer-based
CommitCoordinatorand its config knobs were removed - server DML commits now hand deferred durability to an always-on
leader-based
FsyncPipeline - queued followers can piggyback on a leader fsync when their
commit_lsnis already covered
What the benchmark taught us:
- the implementation is correct and wire-visible semantics remain intact
- but the target workload is sequential request/response autocommit
- the handler still waits for durability before it sends
OK - therefore the next statement cannot arrive while the current fsync is in flight, so single-connection piggyback never materializes
6.19 is closed as an implementation subphase, but this benchmark remains a
documented performance gap rather than a solved target.
Phase 6.18 — Indexed multi-row INSERT batch path
Measured with:
python3 benches/comparison/local_bench.py --scenario insert_multi_values --rows 5000 --table
Workload: multi-row INSERT ... VALUES (...), (... ) statements against the
benchmark schema with PRIMARY KEY (id).
| Operation | MariaDB 12.1 | MySQL 8.0 | AxiomDB |
|---|---|---|---|
insert_multi_values on PK table | 160,581 rows/s | 259,854 rows/s | 321,002 rows/s |
What changed in 6.18:
- the immediate multi-row
VALUESpath no longer checkssecondary_indexes.is_empty()before using grouped heap writes - grouped heap/index apply was extracted into shared helpers reused by both:
- the transactional staging flush from
5.21 - the immediate
INSERT ... VALUES (...), (... )path
- the transactional staging flush from
- the immediate path keeps strict UNIQUE semantics by not reusing the staged
committed_emptyshortcut, because same-statement duplicate keys must still fail without leaking partial rows
heap_multi_insert() and DuckDB's appender both separate row staging from physical write. AxiomDB borrows the grouped physical apply idea, but rejects a blind bulk-load shortcut on the immediate path: duplicate keys inside one SQL statement must still be rejected before any partial batch becomes visible.
Phase 6.20 — UPDATE apply fast path
Measured with python3 benches/comparison/local_bench.py --scenario update_range --rows 5000 --table
against a release AxiomDB server and local MariaDB/MySQL instances on the same
machine. Workload: UPDATE bench_users SET score = score + 1 WHERE id BETWEEN ...
on a PK-indexed table.
| Benchmark | MariaDB 12.1 | MySQL 8.0 | AxiomDB | Notes |
|---|---|---|---|---|
update_range | 618K rows/s | 291K rows/s | 369.9K rows/s | PK range UPDATE now stays on a batched read/apply path end-to-end |
What changed in 6.20:
IndexLookup/IndexRangecandidate rows are fetched throughread_rows_batch(...)instead of one heap read per RID- no-op UPDATE rows are filtered before heap/index mutation
- stable-RID rows batch
UpdateInPlaceWAL append withreserve_lsns(...) + write_batch(...) - UPDATE index maintenance now uses grouped delete+insert with one root persistence write per affected index
- both ctx and non-ctx UPDATE paths share a statement-level index bailout
This closes the dominant apply-side debt left behind after 6.17. The benchmark
improves by 4.3x over the 6.17 result (85.2K rows/s) and now beats the
documented local MySQL result on the same workload.
Phase 5.19 / 5.20 — DELETE WHERE and UPDATE Write Paths
Measured with python3 benches/comparison/local_bench.py --scenario all --rows 50000 --table
on the same Apple M2 Pro machine. The benchmark uses the MySQL wire protocol and a
bench_users table with PRIMARY KEY (id).
| Operation | MariaDB 12.1 | MySQL 8.0 | AxiomDB | PostgreSQL 16 |
|---|---|---|---|---|
DELETE WHERE id > 25000 | 652K rows/s | 662K rows/s | 1.13M rows/s | 3.76M rows/s |
UPDATE ... WHERE active = TRUE | 662K rows/s | 404K rows/s | 648K rows/s | 270K rows/s |
5.19 removed the old per-row delete_in(...) loop by batching exact encoded keys
per index through delete_many_in(...). 5.20 finished the UPDATE recovery by
preserving the original RID whenever the rewritten row still fits in the same slot.
For UPDATE, the before/after delta is the important signal:
- Post-
5.19/ pre-5.20:52.9K rows/s - Post-
5.20:648K rows/s
That is a ~12.2× improvement on the same workload.
Phase 5.13 — Prepared Statement Plan Cache
Phase 5.13 introduces an AST-level plan cache for prepared statements. The full parse +
analyze pipeline runs once at COM_STMT_PREPARE time; each subsequent
COM_STMT_EXECUTE performs only a tree walk to substitute parameter values (~1 µs)
and then calls execute_stmt() directly.
| Path | Parse + Analyze | Param substitution | Total SQL overhead |
|---|---|---|---|
COM_QUERY (text protocol) | ~1.5 µs per call | — | ~1.5 µs |
COM_STMT_EXECUTE before 5.13 | ~1.5 µs per call (re-parse) | string replace | ~1.5 µs |
COM_STMT_EXECUTE after 5.13 | 0 (cached) | ~1 µs AST walk | ~1 µs |
The ~0.5 µs saving per execute is meaningful for high-frequency statement patterns (e.g., ORM-generated queries that re-execute the same SELECT or INSERT with different parameters on every request).
Remaining bottleneck: the dominant cost per COM_STMT_EXECUTE is now the WAL
transaction overhead (BEGIN/COMMIT I/O) rather than parse/analyze. For read-only
prepared statements, Phase 6 indexed reads will eliminate full-table scans, reducing
the per-query execution cost. For write statements, the Phase 8 batch API will coalesce
WAL entries, targeting the 180K ops/s budget.
Stmt (AST with resolved column indices)
rather than the original SQL string. This means each execute avoids both lexing and
semantic analysis, not just parsing. The trade-off is that the cached AST must be
cloned before parameter substitution to avoid mutating shared state — a shallow clone
of the expression tree is ~200 ns, well below the ~1.5 µs that parse + analyze would
cost. MySQL and PostgreSQL cache parsed + planned query trees for the same reason.
Running Benchmarks Locally
# B+ Tree
cargo bench --bench btree -p axiomdb-index
# Storage engine
cargo bench --bench storage -p axiomdb-storage
# SQL parser
cargo bench --bench parser -p axiomdb-sql
# All benchmarks
cargo bench --workspace
# Compare before/after a change
cargo bench -- --save-baseline before
# ... make change ...
cargo bench -- --baseline before
# Detailed comparison with critcmp
cargo install critcmp
critcmp before after
Benchmarks use Criterion.rs and emit JSON results to target/criterion/. Each
run reports mean, standard deviation, min, max, and throughput (ops/s or bytes/s
depending on the benchmark).