AxiomDB

AxiomDB is a database engine written in Rust, designed to be fast, correct, and modern — while remaining compatible with the MySQL wire protocol so existing applications can connect without driver changes.

Goals

Goal	How
Faster than MySQL for read-heavy workloads	Copy-on-Write B+ Tree with lock-free readers
Crash-safe without the MySQL double-write buffer overhead	Append-only WAL, no double-write
Drop-in compatible with MySQL clients	MySQL wire protocol on port 3306
Embeddable like SQLite	C FFI, no daemon required (Phase 10)
Modern SQL out of the box	Unicode-correct collation, strict mode, structured errors

Two Usage Modes

┌─────────────────────┐         ┌──────────────────────────┐
│   SERVER MODE       │         │   EMBEDDED MODE          │
│                     │         │                          │
│  TCP :3306          │         │  Direct function call    │
│  MySQL wire proto   │         │  C FFI / Rust API        │
│  PHP, Python, Node  │         │  No network, no daemon   │
└─────────────────────┘         └──────────────────────────┘
          └─────────────────┬─────────────────┘
                            │
                    Same Rust engine

Current Status

AxiomDB is under active development. Phases 1–6 are substantially complete:

✅ Storage engine — mmap-based 16 KB pages, freelist, heap pages, CRC32c checksums
✅ B+ Tree — Copy-on-Write, lock-free readers, prefix compression, range scan
✅ WAL — append-only, crash recovery, Group Commit, PageWrite bulk optimization
✅ Catalog — schema management, DDL change notifications, MVCC-consistent reads
✅ SQL layer — full DDL + DML parser, expression evaluator, semantic analyzer
✅ Executor — SELECT/INSERT/UPDATE/DELETE, JOIN, GROUP BY + aggregates, ORDER BY, subqueries, CASE WHEN, DISTINCT, TRUNCATE, ALTER TABLE
✅ Secondary indexes — CREATE INDEX, UNIQUE, query planner (index lookup + range)
✅ MySQL wire protocol — port 3306, COM_QUERY, prepared statements, pymysql compatible

Current concurrency model: read-only queries run concurrently, but mutating statements are still serialized through a database-wide Arc<RwLock<Database>> write guard. Row-level locking and true concurrent writers are planned for Phase 13.7+.

Performance highlights

Operation	AxiomDB	vs competition
Bulk INSERT (multi-row, 10K rows)	211K rows/s	1.5× faster than MariaDB 12.1
Full-table DELETE (10K rows)	1M rows/s	3× faster than MariaDB, 40× than MySQL 8.0
Full scan SELECT (10K rows)	212K rows/s	≈ MySQL 8.0
Simple SELECT parse	492 ns	parity with MySQL
Range scan 10K rows	0.61 ms	13× faster than MySQL (45 ms target)

What Makes AxiomDB Different

1. No double-write buffer

MySQL InnoDB uses a double-write buffer to protect against partial page writes, adding significant write overhead. AxiomDB uses a WAL-first architecture — pages are protected by the write-ahead log, eliminating this overhead entirely.

🚀

Performance Advantage MySQL InnoDB performs 2× the disk writes for every page flush — once to the double-write buffer, once to the data file. AxiomDB eliminates this overhead by using the WAL as the crash-safety mechanism, with per-page CRC32c checksums to detect and recover from partial writes.

2. Lock-free read path

The B+ Tree uses Copy-on-Write semantics with an atomic root pointer, so the storage layer itself does not need per-page read latches. In the current server runtime, read-only queries execute concurrently, while mutating statements are still serialized by a database-wide RwLock write guard. Row-level write concurrency is the next planned step.

🚀

Read-Path Advantage MySQL InnoDB and PostgreSQL both pay per-page latch overhead during B+ Tree or buffer traversal. AxiomDB readers load an atomic root pointer and read owned `PageRef` copies instead, so read-only workloads avoid that page-latch cost even though write-side row locking is still a future phase.

3. Smart collation out of the box

Most databases require explicit COLLATE declarations for correct Unicode sorting. AxiomDB defaults to UCA root collation (language-neutral Unicode ordering) and can be configured to behave like MySQL or PostgreSQL for migrations.

4. Strict mode always on

AxiomDB rejects data truncation, invalid dates (0000-00-00), and silent type coercions that MySQL allows by default. With SET AXIOM_COMPAT = 'mysql', lenient behavior is restored for migration scenarios.

5. Structured error messages

Inspired by the Rust compiler, every error includes: what went wrong, which table/column was involved, the offending value, and a hint for how to fix it.

Parser Performance

AxiomDB’s SQL parser is 9–17× faster than sqlparser-rs (the production standard used by Apache Arrow DataFusion and Delta Lake):

Query type	AxiomDB	sqlparser-rs	Speedup
Simple SELECT	492 ns	4.38 µs	8.9×
Complex SELECT (multi-JOIN)	2.74 µs	27.0 µs	9.8×
CREATE TABLE	824 ns	14.5 µs	16.6×

This is achieved through a zero-copy lexer (identifiers are &str slices into the input — no heap allocations) combined with a hand-written recursive descent parser.

🚀

Parser Benchmark Context sqlparser-rs is used by Apache Arrow DataFusion, Delta Lake, and InfluxDB — widely considered the production standard for Rust SQL parsing. The 9–17× speedup is measured single-threaded, parse-only. At 2M simple queries/s, parsing is never the bottleneck for any realistic OLTP workload.

Getting Started

AxiomDB is a relational database engine written in Rust. It supports standard SQL, ACID transactions, a Write-Ahead Log for crash recovery, and a Copy-on-Write B+ Tree for lock-free concurrent reads. This guide walks you through connecting to AxiomDB, choosing a usage mode, and running your first queries.

Choosing a Usage Mode

AxiomDB operates in two distinct modes that share the exact same engine code.

Server Mode

The engine runs as a standalone daemon that speaks the MySQL wire protocol on TCP port 3306 (configurable). Any MySQL-compatible client connects without installing custom drivers.

Application (PHP / Python / Node.js)
        │
        │ TCP :3306  (MySQL wire protocol)
        ▼
  axiomdb-server process
        │
        ▼
  axiomdb.db   axiomdb.wal

When to use server mode:

Web applications with REST or GraphQL APIs
Microservices where multiple processes share a database
Any environment where you would normally use MySQL

Embedded Mode

The engine is compiled into your process as a shared library (.so / .dylib / .dll). There is no daemon, no network, and no port. Calls go directly to Rust code with microsecond latency.

Your Application (Rust / C++ / Python / Electron)
        │
        │ direct function call (C FFI / Rust crate)
        ▼
  AxiomDB engine (in-process)
        │
        ▼
  axiomdb.db   axiomdb.wal   (local files)

When to use embedded mode:

Desktop applications (Qt, Electron, Tauri)
CLI tools that need a local database
Python scripts that need fast local storage without a daemon
Any context where SQLite would be considered

Mode Comparison

Feature	Server Mode	Embedded Mode
Latency	~0.1 ms (TCP loopback)	~1 µs (in-process)
Multiple processes	Yes	No (one process)
Installation	Binary + port	Library only
Compatible clients	Any MySQL client	Rust crate / C FFI
Ideal for	Web, APIs, microservices	Desktop, CLI, scripts

Interactive Shell (CLI)

The axiomdb-cli binary connects directly to a database file — no server needed. It works like sqlite3 or psql:

# Open an existing database (or create a new one)
axiomdb-cli ./mydb.db

# Pipe SQL from a file
axiomdb-cli ./mydb.db < migration.sql

# One-liner
echo "SELECT COUNT(*) FROM users;" | axiomdb-cli ./mydb.db

Inside the shell:

AxiomDB 0.1.0 — interactive shell
Type SQL ending with ; to execute. Type .help for commands.

axiomdb> CREATE TABLE users (id INT, name TEXT);
OK (1ms)

axiomdb> INSERT INTO users VALUES (1, 'Alice'), (2, 'Bob');
2 rows affected (0ms)

axiomdb> SELECT * FROM users;
+----+-------+
| id | name  |
+----+-------+
|  1 | Alice |
|  2 | Bob   |
+----+-------+
2 rows (0ms)

axiomdb> .tables
users

axiomdb> .schema users
Table: users
  id    INT   NOT NULL
  name  TEXT  nullable

axiomdb> .quit
Bye.

Dot commands: .help · .tables · .schema [table] · .open <path> · .quit

Keyboard shortcuts (interactive mode): ↑ / ↓ history · Tab SQL completion · Ctrl-R reverse search · Ctrl-C cancel line · Ctrl-D exit. History is saved to ~/.axiomdb_history between sessions.

Server Mode — Connecting

Starting the Server

# Default: stores data in ./data, listens on port 3306
axiomdb-server

# Legacy env vars
AXIOMDB_DATA=/var/lib/axiomdb AXIOMDB_PORT=3307 axiomdb-server

# DSN bootstrap (Phase 5.15)
AXIOMDB_URL='axiomdb://0.0.0.0:3307/axiomdb?data_dir=/var/lib/axiomdb' axiomdb-server

The server is ready when you see:

INFO axiomdb_server: listening on 0.0.0.0:3306

⚙️

Design Decision — Parse Once AxiomDB borrows PostgreSQL libpq's split between URI parsing and consumer-specific validation: AXIOMDB_URL is normalized in shared core code first, then the server accepts only the fields it actually supports in Phase 5.15 instead of silently inventing meanings for extra options.

In Phase 5.15, AXIOMDB_URL supports axiomdb://, mysql://, postgres://, and postgresql:// URI syntax. The alias schemes are parse aliases only: axiomdb-server still speaks the MySQL wire protocol only.

Supported server DSN fields:

host and port from the URI authority
data_dir from the query string

Unsupported query params are rejected explicitly instead of being ignored.

Connecting with the mysql CLI

mysql -h 127.0.0.1 -P 3306 -u root

No password is required in Phase 5. Any username from the allowlist (root, axiomdb, admin) is accepted. See the Authentication section below for details.

Connecting with Python (PyMySQL)

import pymysql

conn = pymysql.connect(
    host='127.0.0.1',
    port=3306,
    user='root',
    db='axiomdb',
    charset='utf8mb4',
)

with conn.cursor() as cursor:
    # CREATE TABLE with AUTO_INCREMENT
    cursor.execute("""
        CREATE TABLE users (
            id    BIGINT PRIMARY KEY AUTO_INCREMENT,
            name  TEXT   NOT NULL,
            email TEXT   NOT NULL
        )
    """)

    # INSERT — last_insert_id is returned in the OK packet
    cursor.execute("INSERT INTO users (name, email) VALUES ('Alice', 'alice@example.com')")
    print("inserted id:", cursor.lastrowid)

    # SELECT
    cursor.execute("SELECT id, name FROM users")
    for row in cursor.fetchall():
        print(row)

conn.close()

💡

Tip — Batch Single-Row INSERTs If your application emits many one-row INSERT statements, wrap them in an explicit BEGIN ... COMMIT. Phase 5.21 stages consecutive INSERT ... VALUES statements in one transaction and flushes them together, which is much faster than committing each row independently.

Parameterized Queries and ORMs (Prepared Statements)

When you pass parameters to cursor.execute(), PyMySQL (and any MySQL-compatible driver) automatically uses COM_STMT_PREPARE / COM_STMT_EXECUTE — the MySQL binary prepared statement protocol. AxiomDB supports this natively from Phase 5.10.

import pymysql

conn = pymysql.connect(host='127.0.0.1', port=3306, user='root', db='axiomdb')

with conn.cursor() as cursor:
    cursor.execute("""
        CREATE TABLE products (
            id    BIGINT PRIMARY KEY AUTO_INCREMENT,
            name  TEXT   NOT NULL,
            price DOUBLE NOT NULL,
            active BOOL  NOT NULL DEFAULT TRUE
        )
    """)
    conn.commit()

    # Parameterized INSERT — uses COM_STMT_PREPARE/EXECUTE automatically
    cursor.execute(
        "INSERT INTO products (name, price, active) VALUES (%s, %s, %s)",
        ('Wireless Keyboard', 49.99, True),
    )

    # NULL parameters work transparently
    cursor.execute(
        "INSERT INTO products (name, price, active) VALUES (%s, %s, %s)",
        ('USB-C Hub', 29.99, None),
    )

    # Parameterized SELECT
    cursor.execute("SELECT id, name, price FROM products WHERE price < %s", (50.0,))
    for row in cursor.fetchall():
        print(row)

    # Boolean column comparison works with integer literals (MySQL-compatible)
    cursor.execute("SELECT name FROM products WHERE active = %s", (1,))
    for row in cursor.fetchall():
        print(row)

conn.close()

ORMs such as SQLAlchemy use parameterized queries for all data-bearing operations. Connecting through the MySQL dialect works without any additional configuration:

from sqlalchemy import create_engine, text

engine = create_engine("mysql+pymysql://root@127.0.0.1:3306/axiomdb")

with engine.connect() as conn:
    result = conn.execute(
        text("SELECT id, name FROM products WHERE price < :max_price"),
        {"max_price": 40.0},
    )
    for row in result:
        print(row)

💡

Prepared Statement Lifecycle Each call to cursor.execute(sql, params) sends a COM_STMT_PREPARE to parse the SQL and register a statement ID, followed by COM_STMT_EXECUTE with the binary-encoded parameters. The statement is cached per connection in AxiomDB and released with COM_STMT_CLOSE when the cursor closes. This matches the behavior expected by PyMySQL, mysqlclient, and SQLAlchemy's MySQL dialect.

Connecting with PHP (PDO)

<?php
$pdo = new PDO(
    'mysql:host=127.0.0.1;port=3306;dbname=axiomdb',
    'root',
    '',
    [PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION]
);

$stmt = $pdo->query('SELECT id, name FROM users LIMIT 5');
foreach ($stmt as $row) {
    echo $row['id'] . ': ' . $row['name'] . "\n";
}

Connecting with any MySQL GUI

Point MySQL Workbench, DBeaver, or TablePlus to 127.0.0.1:3306. No driver installation is required — the MySQL wire protocol is fully compatible.

Charset and collation

AxiomDB negotiates charset and collation at the MySQL handshake boundary. The client sends its preferred collation id in the HandshakeResponse41 packet; the server reads it and configures the session accordingly.

Supported charsets:

Charset	Collation ids	Notes
`utf8mb4`	45 (0900_ai_ci), 46 (0900_as_cs), 255 (0900_ai_ci)	Default for new connections
`utf8` / `utf8mb3`	33 (general_ci), 83 (bin)	BMP-only; 4-byte code points (emoji) rejected
`latin1`	8 (swedish_ci), 47 (bin)	MySQL latin1 = Windows-1252 (0x80 = ‘€’, not ISO-8859-1)
`binary`	63	Raw bytes, no transcoding

You can change the session charset at any time:

SET NAMES utf8mb4;                          -- sets client + connection + results
SET NAMES latin1 COLLATE latin1_bin;        -- with explicit collation
SET character_set_results = utf8mb4;        -- results charset only

💡

Always use utf8mb4 Use charset='utf8mb4' in your client connection string. The AxiomDB engine stores everything as UTF-8; utf8mb4 requires zero transcoding overhead and supports the full Unicode range including emoji. Latin1 connections are supported for legacy PHP/MySQL applications.

Authentication

AxiomDB Phase 5 uses permissive authentication: the server accepts any password for usernames in the allowlist (root, axiomdb, admin, and the empty string). Both of the most common MySQL authentication plugins are supported with no client-side configuration required:

Plugin	Clients	Notes
`mysql_native_password`	MySQL 5.x clients, older PyMySQL, mysql2 < 0.5	3-packet handshake (greeting → response → OK)
`caching_sha2_password`	MySQL 8.0+ default, PyMySQL >= 1.0, MySQL Connector/Python	5-packet handshake (greeting → response → fast_auth_success → ack → OK)

If your client connects with MySQL 8.0+ defaults and you see silent connection drops, your client is using caching_sha2_password — AxiomDB handles this automatically. No --default-auth flag or authPlugin option is needed.

Full password enforcement with stored credentials is planned for Phase 13 (Security).

💡

Connecting from ORMs SQLAlchemy, ActiveRecord, and similar ORMs send several setup queries on connect (SET NAMES, SELECT @@version, SHOW DATABASES, etc.). AxiomDB intercepts and stubs these automatically — no configuration needed.

Monitoring with SHOW STATUS

Monitoring tools, proxy servers, and health checks can query live server counters using the standard MySQL SHOW STATUS syntax:

SHOW STATUS
SHOW GLOBAL STATUS
SHOW SESSION STATUS
SHOW STATUS LIKE 'Threads%'
SHOW GLOBAL STATUS LIKE 'Com_%'

Available variables:

Variable	Scope	Description
`Uptime`	Global	Seconds since server start
`Threads_connected`	Global	Currently authenticated connections
`Threads_running`	Global	Connections actively executing a command
`Questions`	Session + Global	Total statements executed
`Bytes_received`	Session + Global	Bytes received from clients
`Bytes_sent`	Session + Global	Bytes sent to clients
`Com_select`	Session + Global	`SELECT` statement count
`Com_insert`	Session + Global	`INSERT` statement count
`Innodb_buffer_pool_read_requests`	Global	Storage read requests (compatibility)
`Innodb_buffer_pool_reads`	Global	Physical page reads (compatibility)

Session scope (SHOW STATUS, SHOW SESSION STATUS, SHOW LOCAL STATUS) returns per-connection values. Global scope (SHOW GLOBAL STATUS) returns server-wide totals. Session counters reset when a connection is closed or COM_RESET_CONNECTION is issued.

Connection Timeout Variables

AxiomDB exposes the same timeout variables that MySQL clients expect at the session level:

SET wait_timeout = 30;
SET interactive_timeout = 300;
SET net_read_timeout = 60;
SET net_write_timeout = 60;

SELECT @@wait_timeout;
SELECT @@interactive_timeout;
SELECT @@net_read_timeout;
SELECT @@net_write_timeout;

Rules:

wait_timeout applies while a non-interactive connection is idle between commands.
interactive_timeout applies instead when the client connected with CLIENT_INTERACTIVE.
net_write_timeout bounds packet writes once a command is already executing.
net_read_timeout is reserved for future in-flight protocol reads and is already validated/stored as a real session variable.
COM_RESET_CONNECTION resets all four variables back to their defaults.

Trying to set one of these variables to 0 or to a non-integer value returns an error:

SET wait_timeout = 0;
-- ERROR ... wait_timeout must be a positive integer, got '0'

💡

Interactive Clients If a driver or tool connects with `CLIENT_INTERACTIVE`, AxiomDB keeps using that classification even after `COM_RESET_CONNECTION`. Resetting the session restores timeout values, but it does not turn an interactive connection into a non-interactive one.

Embedded Mode — Rust API

Add AxiomDB to your Cargo.toml:

[dependencies]
axiomdb-embedded = { path = "../axiomdb/crates/axiomdb-embedded" }

Open a Database

use axiomdb_embedded::Db;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut db = Db::open("./axiomdb.db")?;
    let mut db2 = Db::open_dsn("file:/tmp/axiomdb.db")?;
    let mut db3 = Db::open_dsn("axiomdb:///tmp/axiomdb")?;

    db.execute("CREATE TABLE users (id INT, name TEXT, age INT)")?;
    db.execute("INSERT INTO users VALUES (1, 'Alice', 30)")?;
    db.execute("INSERT INTO users VALUES (2, 'Bob', 25)")?;

    let (columns, rows) = db.query_with_columns(
        "SELECT id, name, age FROM users WHERE age > 20 ORDER BY name"
    )?;
    println!("{columns:?}");
    for row in rows {
        println!("{row:?}");
    }

    Ok(())
}

Db::open_dsn(...) accepts only local DSNs in Phase 5.15. Remote wire-endpoint DSNs such as postgres://... parse successfully in the shared parser but are rejected by the embedded API.

Explicit Transactions

#![allow(unused)]
fn main() {
let mut db = axiomdb_embedded::Db::open("./axiomdb.db")?;
db.begin()?;
db.execute("INSERT INTO accounts VALUES (1, 'Alice', 1000.0)")?;
db.execute("INSERT INTO accounts VALUES (2, 'Bob', 500.0)")?;
db.commit()?;
}

Embedded Mode — C FFI

For C, C++, Qt, or Java (JNI):

#include "axiomdb.h"

int main(void) {
    AxiomDb* db = axiomdb_open("./axiomdb.db");
    AxiomDb* db2 = axiomdb_open_dsn("file:/tmp/axiomdb.db");
    if (!db) { fprintf(stderr, "failed to open\n"); return 1; }

    axiomdb_execute(db, "CREATE TABLE users (id INT, name TEXT)");
    axiomdb_execute(db, "INSERT INTO users VALUES (1, 'Alice')");
    axiomdb_close(db);
    axiomdb_close(db2);
    return 0;
}

Python via ctypes

import ctypes

lib = ctypes.CDLL("./libaxiomdb.dylib")
lib.axiomdb_open.restype  = ctypes.c_void_p
lib.axiomdb_open_dsn.restype = ctypes.c_void_p
lib.axiomdb_close.argtypes = [ctypes.c_void_p]
lib.axiomdb_execute.restype = ctypes.c_longlong

db = lib.axiomdb_open(b"./axiomdb.db")
db2 = lib.axiomdb_open_dsn(b"file:/tmp/axiomdb.db")
lib.axiomdb_execute(db, b"CREATE TABLE t (id INT)")
lib.axiomdb_close(db)
lib.axiomdb_close(db2)

Your First Schema — End to End

The following example creates a minimal e-commerce schema, inserts sample data, and runs a join query — all within embedded mode.

-- Create tables
CREATE TABLE products (
    id          BIGINT      PRIMARY KEY AUTO_INCREMENT,
    name        TEXT        NOT NULL,
    price       DECIMAL     NOT NULL,
    stock       INT         NOT NULL DEFAULT 0
);

CREATE TABLE orders (
    id          BIGINT      PRIMARY KEY AUTO_INCREMENT,
    product_id  BIGINT      NOT NULL REFERENCES products(id) ON DELETE RESTRICT,
    quantity    INT         NOT NULL,
    placed_at   TIMESTAMP   NOT NULL
);

CREATE INDEX idx_orders_product ON orders (product_id);

-- Insert data
INSERT INTO products (name, price, stock) VALUES
    ('Wireless Keyboard', 49.99, 200),
    ('USB-C Hub',         29.99, 500),
    ('Mechanical Mouse',  39.99, 150);

INSERT INTO orders (product_id, quantity, placed_at) VALUES
    (1, 2, '2026-03-01 10:00:00'),
    (2, 1, '2026-03-02 14:30:00'),
    (1, 1, '2026-03-03 09:15:00');

-- Query with JOIN
SELECT
    p.name,
    o.quantity,
    p.price * o.quantity AS line_total,
    o.placed_at
FROM orders o
JOIN products p ON p.id = o.product_id
ORDER BY o.placed_at;

Expected output:

name	quantity	line_total	placed_at
Wireless Keyboard	2	99.98	2026-03-01 10:00:00
USB-C Hub	1	29.99	2026-03-02 14:30:00
Wireless Keyboard	1	49.99	2026-03-03 09:15:00

Bulk Insert — Best Practices

The way you issue INSERT statements has a large impact on throughput. AxiomDB is optimized for the multi-row VALUES form — one SQL string with all N rows:

-- Fast: one SQL string, all rows in one VALUES clause (~211K rows/s for 10K rows)
INSERT INTO products (name, price, stock) VALUES
  ('Widget A', 9.99, 100),
  ('Widget B', 14.99, 50),
  ('Widget C', 4.99, 200);

# Python — build one multi-row string, one execute() call
rows = [(f"product_{i}", i * 1.5, i * 10) for i in range(10_000)]
placeholders = ", ".join("(%s, %s, %s)" for _ in rows)
flat_values   = [v for row in rows for v in row]
cursor.execute(f"INSERT INTO products (name, price, stock) VALUES {placeholders}",
               flat_values)
conn.commit()

Why this matters: issuing N separate INSERT statements each pays its own parse + analyze overhead (~20 µs per string). A single multi-row string pays that cost once for all rows.

Approach	Throughput
Multi-row VALUES (1 string, N rows)	211K rows/s — recommended
N separate INSERT strings (1 txn)	~35K rows/s — 6× slower
N separate autocommit INSERTs	~58 q/s — 1 fsync per row

💡

Tip — Batching large datasets For millions of rows, wrap every 5,000–10,000 rows in an explicit BEGIN … COMMIT block. This limits WAL growth per transaction while keeping throughput high. See Transactions for Group Commit configuration, which further improves concurrent write throughput.

Next Steps

SQL Reference — Data Types — full type system
SQL Reference — DDL — CREATE TABLE, indexes, constraints
SQL Reference — DML — SELECT, INSERT, UPDATE, DELETE
Transactions — BEGIN, COMMIT, ROLLBACK, MVCC
Performance — benchmark numbers and tuning tips

SQL Reference

This section covers the complete SQL dialect supported by AxiomDB.

Data Types — all supported column types with storage sizes and usage examples
DDL — Schema Definition — CREATE TABLE, CREATE INDEX, DROP TABLE, DROP INDEX, constraints
DML — Queries & Mutations — SELECT, INSERT, UPDATE, DELETE with full clause reference
Expressions & Operators — operators, functions, NULL semantics, LIKE, IN, BETWEEN

Data Types

AxiomDB implements a rich type system that covers the common SQL standard types as well as several extensions for modern workloads (UUID, JSON, VECTOR for AI embeddings, RANGE types for temporal and numeric overlaps).

Integer Types

SQL Type	Aliases	Storage	Rust type	Range
`BOOL`	`BOOLEAN`	1 byte	`bool`	TRUE / FALSE
`TINYINT`	`INT1`	1 byte	`i8`	-128 to 127
`UTINYINT`	`UINT1`	1 byte	`u8`	0 to 255
`SMALLINT`	`INT2`	2 bytes	`i16`	-32,768 to 32,767
`USMALLINT`	`UINT2`	2 bytes	`u16`	0 to 65,535
`INT`	`INTEGER, INT4`	4 bytes	`i32`	-2,147,483,648 to 2,147,483,647
`UINT`	`UINT4`	4 bytes	`u32`	0 to 4,294,967,295
`BIGINT`	`INT8`	8 bytes	`i64`	-9.2 × 10¹⁸ to 9.2 × 10¹⁸
`UBIGINT`	`UINT8`	8 bytes	`u64`	0 to 18.4 × 10¹⁸ (used for LSN, page_id)
`HUGEINT`	`INT16`	16 bytes	`i128`	±1.7 × 10³⁸ (cryptography, checksums)

-- Typical primary key
CREATE TABLE users (
    id   BIGINT PRIMARY KEY AUTO_INCREMENT,
    age  SMALLINT NOT NULL
);

-- Unsigned counter that never goes negative
CREATE TABLE page_views (
    page_id  INT  NOT NULL,
    views    UINT NOT NULL DEFAULT 0
);

Floating-Point Types

SQL Type	Aliases	Storage	Rust type	Notes
`REAL`	`FLOAT4`, `FLOAT`	4 bytes	`f32`	Coordinates, ratings, embeddings
`DOUBLE`	`FLOAT8`, `DOUBLE PRECISION`	8 bytes	`f64`	Scientific calculations

NaN is forbidden. The row codec rejects NaN values at encode time. IEEE 754 infinities are also not accepted by default.

-- Geospatial coordinates (4-byte precision is sufficient)
CREATE TABLE locations (
    id   INT   PRIMARY KEY,
    lat  REAL  NOT NULL,
    lon  REAL  NOT NULL
);

-- Scientific measurements requiring high precision
CREATE TABLE experiments (
    id      INT    PRIMARY KEY,
    result  DOUBLE NOT NULL
);

Exact Numeric — DECIMAL

SQL Type	Aliases	Storage	Rust type	Notes
`DECIMAL(p, s)`	`NUMERIC(p, s)`	17 bytes	`i128` + `u8` scale	Exact arithmetic, no float error

Always use DECIMAL for money. Floating-point types cannot represent 0.1 + 0.2 exactly; DECIMAL always can.

CREATE TABLE invoices (
    id       BIGINT       PRIMARY KEY AUTO_INCREMENT,
    subtotal DECIMAL      NOT NULL,    -- DECIMAL without precision = DECIMAL(38,0)
    tax_rate DECIMAL      NOT NULL,
    total    DECIMAL      NOT NULL
);

-- Insert with exact values
INSERT INTO invoices (subtotal, tax_rate, total)
VALUES (199.99, 0.19, 237.99);

-- Arithmetic is always exact
SELECT subtotal * tax_rate AS computed_tax FROM invoices WHERE id = 1;
-- Returns: 37.9981  (never 37.99809999999...)

The internal codec stores DECIMAL as a 16-byte little-endian i128 mantissa followed by a 1-byte scale (total 17 bytes per non-NULL value).

Text Types

SQL Type	Max length	Rust type	Notes
`CHAR(n)`	n bytes (fixed)	`[u8; n]`	Right-padded with spaces
`VARCHAR(n)`	n bytes (max)	`String`	Variable, UTF-8
`TEXT`	16,777,215 bytes	`String`	Unlimited (TOAST if >16 KB)
`CITEXT`	16,777,215 bytes	`String`	Case-insensitive comparison

The codec encodes TEXT and VARCHAR with a 3-byte (u24) length prefix followed by raw UTF-8 bytes. This limits inline storage to 16,777,215 bytes; values larger than a page use TOAST (planned Phase 6).

-- Fixed-length codes (ISO country, state abbreviations)
CREATE TABLE countries (
    code  CHAR(2)      PRIMARY KEY,   -- 'US', 'DE', 'JP'
    name  VARCHAR(128) NOT NULL
);

-- Unlimited text content
CREATE TABLE blog_posts (
    id      BIGINT PRIMARY KEY AUTO_INCREMENT,
    title   VARCHAR(512) NOT NULL,
    body    TEXT         NOT NULL
);

-- Case-insensitive email lookup
CREATE TABLE users (
    id    BIGINT PRIMARY KEY AUTO_INCREMENT,
    email CITEXT NOT NULL UNIQUE
);
-- SELECT * FROM users WHERE email = 'ALICE@EXAMPLE.COM'
-- matches rows where email = 'alice@example.com'

Binary Type

SQL Type	Aliases	Max length	Rust type	Notes
`BYTEA`	`BLOB`, `BYTES`	16,777,215 bytes	`Vec<u8>`	Raw bytes, hex display

CREATE TABLE attachments (
    id      BIGINT PRIMARY KEY AUTO_INCREMENT,
    name    TEXT   NOT NULL,
    content BYTEA  NOT NULL
);

-- Insert binary with hex literal
INSERT INTO attachments (name, content) VALUES ('icon.png', X'89504e47');

-- Display as hex
SELECT name, encode(content, 'hex') FROM attachments;

Date and Time Types

SQL Type	Storage	Internal repr	Notes
`DATE`	4 bytes	`i32` days since 1970-01-01	No time component
`TIME`	8 bytes	`i64` µs since midnight	No timezone
`TIMETZ`	12 bytes	`i64` µs + `i32` offset	Time with timezone offset
`TIMESTAMP`	8 bytes	`i64` µs since UTC epoch	Without timezone (ambiguous)
`TIMESTAMPTZ`	8 bytes	`i64` µs UTC	Recommended. Always UTC internally
`INTERVAL`	16 bytes	`i32` months + `i32` days + `i64` µs	Correct calendar arithmetic

Prefer TIMESTAMPTZ over TIMESTAMP. Without a timezone, there is no way to determine the absolute instant when the server and client are in different timezones. TIMESTAMPTZ stores everything as UTC and converts on display.

CREATE TABLE events (
    id          BIGINT      PRIMARY KEY AUTO_INCREMENT,
    title       TEXT        NOT NULL,
    starts_at   TIMESTAMPTZ NOT NULL,
    ends_at     TIMESTAMPTZ NOT NULL,
    duration    INTERVAL
);

INSERT INTO events (title, starts_at, ends_at, duration)
VALUES (
    'Team meeting',
    '2026-03-21 10:00:00+00',
    '2026-03-21 11:00:00+00',
    '1 hour'
);

INTERVAL — Calendar-Correct Arithmetic

INTERVAL separates months, days, and microseconds because they are not fixed durations:

“1 month” added to January 31 gives February 28 (or 29).
“1 day” during a DST transition can be 23 or 25 hours.

-- Add 1 month to a date (calendar-aware)
SELECT '2026-01-31'::DATE + INTERVAL '1 month';  -- 2026-02-28

-- Add 30 days (fixed)
SELECT '2026-01-31'::DATE + INTERVAL '30 days';  -- 2026-03-02

UUID

SQL Type	Storage	Notes
`UUID`	16 bytes	Stored as raw 16 bytes, displayed as hex

CREATE TABLE sessions (
    id         UUID   PRIMARY KEY DEFAULT gen_uuid_v7(),
    user_id    BIGINT NOT NULL,
    created_at TIMESTAMPTZ NOT NULL
);

UUID v7 vs v4 as Primary Key:

Strategy	Insert rate (1M rows)	Reason
UUID v4	~150k inserts/s	Random → many B+ Tree page splits
UUID v7	~250k inserts/s	Time-ordered prefix → nearly sequential
BIGINT	~280k inserts/s	Fully sequential

For new schemas, prefer UUID v7 (time-sortable) or BIGINT AUTO_INCREMENT.

Network Types

SQL Type	Storage	Notes
`INET`	16 bytes	IPv4 or IPv6 address
`CIDR`	17 bytes	IP network with prefix mask
`MACADDR`	6 bytes	MAC address

CREATE TABLE access_log (
    id         BIGINT PRIMARY KEY AUTO_INCREMENT,
    client_ip  INET   NOT NULL,
    network    CIDR,
    mac        MACADDR
);

JSON / JSONB

SQL Type	Aliases	Notes
`JSON`	`JSONB`	Stored as serialized JSON; TOAST if > 2 KB

CREATE TABLE api_responses (
    id       BIGINT PRIMARY KEY AUTO_INCREMENT,
    endpoint TEXT   NOT NULL,
    payload  JSON   NOT NULL
);

INSERT INTO api_responses (endpoint, payload)
VALUES ('/users', '{"count": 42, "items": []}');

VECTOR — AI Embeddings

SQL Type	Storage	Notes
`VECTOR(n)`	`4n` bytes	Array of `n` 32-bit floats (f32)

-- Store sentence embeddings from an AI model
CREATE TABLE documents (
    id        BIGINT      PRIMARY KEY AUTO_INCREMENT,
    content   TEXT        NOT NULL,
    embedding VECTOR(384) NOT NULL   -- e.g. all-MiniLM-L6-v2 output
);

-- Approximate nearest-neighbor search (ANN index required)
SELECT id, content
FROM documents
ORDER BY embedding <-> '[0.12, 0.34, ...]'::vector
LIMIT 10;

RANGE Types

RANGE types represent a continuous span of a base type, with inclusive/exclusive bounds. They support containment (@>), overlapping (&&), and exclusion constraints.

SQL Type	Base type	Example
`INT4RANGE`	`INT`	`[1, 100)`
`INT8RANGE`	`BIGINT`	`[1000, 9999]`
`DATERANGE`	`DATE`	`[2026-01-01, 2026-12-31]`
`TSRANGE`	`TIMESTAMP`	`[2026-01-01 09:00, ...)`
`TSTZRANGE`	`TIMESTAMPTZ`	timezone-aware variant

-- Prevent overlapping reservations using an exclusion constraint
CREATE TABLE room_reservations (
    room_id   INT     NOT NULL,
    period    TSRANGE NOT NULL,
    EXCLUDE USING gist(room_id WITH =, period WITH &&)
);

INSERT INTO room_reservations VALUES (1, '[2026-03-21 09:00, 2026-03-21 11:00)');
-- This next insert fails: the period overlaps with the existing row
INSERT INTO room_reservations VALUES (1, '[2026-03-21 10:00, 2026-03-21 12:00)');
-- ERROR: exclusion constraint violation

NULL in Every Type

Every column of every type can hold NULL unless declared NOT NULL. The row codec stores a compact null bitmap at the start of each row (1 bit per column), so NULL costs only 1 bit of overhead regardless of the underlying type size.

SELECT NULL + 5;         -- NULL  (any arithmetic with NULL propagates NULL)
SELECT NULL = NULL;      -- NULL  (not TRUE — use IS NULL instead)
SELECT NULL IS NULL;     -- TRUE
SELECT COALESCE(NULL, 0); -- 0   (return first non-NULL argument)

See Expressions & Operators for the full NULL semantics table.

DDL — Schema Definition Language

DDL statements define and modify the structure of the database: tables, columns, constraints, and indexes. All DDL operations are transactional in AxiomDB — a failed DDL statement is automatically rolled back.

CREATE DATABASE

Creates a new logical database in the persisted catalog.

Syntax

CREATE DATABASE database_name;

Example

CREATE DATABASE analytics;
SHOW DATABASES;

Expected output includes:

Database
analytics
axiomdb

CREATE DATABASE fails if the name already exists:

CREATE DATABASE analytics;
-- ERROR 1007 (HY000): Can't create database 'analytics'; database exists

DROP DATABASE

Removes a logical database from the catalog.

Syntax

DROP DATABASE database_name;
DROP DATABASE IF EXISTS database_name;

Behavior

Removing a database also removes the tables it owns from SQL/catalog lookup.
IF EXISTS suppresses the error for a missing database.
The current connection cannot drop the database it has selected with USE.

DROP DATABASE analytics;

DROP DATABASE IF EXISTS scratch;

USE analytics;
DROP DATABASE analytics;
-- ERROR 1105 (HY000): Can't drop database 'analytics'; database is currently selected

💡

Current Scope CREATE DATABASE and DROP DATABASE are catalog-backed today, but cross-database queries such as other_db.public.users are still deferred to the next multi-database subphase.

CREATE TABLE

Basic Syntax

CREATE TABLE [IF NOT EXISTS] table_name (
    column_name  data_type  [column_constraints...],
    ...
    [table_constraints...]
);

Column Constraints

NOT NULL

Rejects any attempt to insert or update a row with a NULL value in this column.

CREATE TABLE employees (
    id    BIGINT NOT NULL,
    name  TEXT   NOT NULL,
    dept  TEXT            -- nullable: dept may be unassigned
);

DEFAULT

Provides a value when the column is omitted from INSERT.

CREATE TABLE orders (
    id         BIGINT   PRIMARY KEY AUTO_INCREMENT,
    status     TEXT     NOT NULL DEFAULT 'pending',
    created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    priority   INT      NOT NULL DEFAULT 0
);

-- Default values are used automatically
INSERT INTO orders (status) VALUES ('shipped');
-- Row: id=<auto>, status='shipped', created_at=<now>, priority=0

PRIMARY KEY

Declares a column (or set of columns) as the primary key. A primary key:

Implies NOT NULL
Creates a unique B+ Tree index automatically
Is used for REFERENCES in foreign keys

-- Single-column primary key
CREATE TABLE users (
    id   BIGINT PRIMARY KEY AUTO_INCREMENT,
    name TEXT   NOT NULL
);

-- Composite primary key (declared as table constraint)
CREATE TABLE order_items (
    order_id   BIGINT NOT NULL,
    product_id BIGINT NOT NULL,
    quantity   INT    NOT NULL,
    PRIMARY KEY (order_id, product_id)
);

UNIQUE

Guarantees no two rows share the same value in this column (or set of columns). NULL values are excluded from uniqueness checks — multiple NULLs are allowed.

CREATE TABLE accounts (
    id       BIGINT PRIMARY KEY AUTO_INCREMENT,
    email    TEXT   NOT NULL UNIQUE,
    username TEXT   NOT NULL UNIQUE
);

AUTO_INCREMENT / SERIAL

Automatically generates a monotonically increasing integer for each new row. The counter starts at 1 and increments by 1 for each inserted row. The following forms are all equivalent:

-- MySQL-style
id BIGINT PRIMARY KEY AUTO_INCREMENT

-- PostgreSQL-style shorthand (SERIAL = INT AUTO_INCREMENT, BIGSERIAL = BIGINT AUTO_INCREMENT)
id SERIAL    PRIMARY KEY
id BIGSERIAL PRIMARY KEY

Behavior:

CREATE TABLE users (
    id   BIGINT PRIMARY KEY AUTO_INCREMENT,
    name TEXT   NOT NULL
);

-- Omit the AUTO_INCREMENT column — the engine generates the value
INSERT INTO users (name) VALUES ('Alice');   -- id = 1
INSERT INTO users (name) VALUES ('Bob');     -- id = 2

-- Retrieve the last generated ID (current session only)
SELECT LAST_INSERT_ID();   -- returns 2
SELECT lastval();          -- PostgreSQL alias — same result

-- Multi-row INSERT: LAST_INSERT_ID() returns the ID of the FIRST row in the batch
INSERT INTO users (name) VALUES ('Carol'), ('Dave');  -- ids: 3, 4
SELECT LAST_INSERT_ID();   -- returns 3

-- Explicit non-NULL value bypasses the sequence and does NOT advance it
INSERT INTO users (id, name) VALUES (100, 'Eve');
-- id=100; sequence remains at 4; next auto id will be 5

LAST_INSERT_ID() returns 0 if no auto-increment INSERT has been performed in the current session. See LAST_INSERT_ID() in expressions for the full function reference.

TRUNCATE resets the counter:

TRUNCATE TABLE users;
INSERT INTO users (name) VALUES ('Frank');  -- id = 1 (reset by TRUNCATE)

REFERENCES — Foreign Keys

Declares a foreign key relationship to another table’s primary key.

CREATE TABLE orders (
    id         BIGINT PRIMARY KEY AUTO_INCREMENT,
    user_id    BIGINT NOT NULL REFERENCES users(id) ON DELETE CASCADE,
    product_id BIGINT NOT NULL REFERENCES products(id) ON DELETE RESTRICT,
    placed_at  TIMESTAMP NOT NULL
);

ON DELETE actions:

Action	Behavior when the referenced row is deleted
`RESTRICT`	Reject the DELETE if any referencing row exists (default)
`CASCADE`	Delete all referencing rows automatically
`SET NULL`	Set the foreign key column to NULL
`SET DEFAULT`	Set the foreign key column to its DEFAULT value
`NO ACTION`	Same as RESTRICT but deferred to end of statement

ON UPDATE actions: Same options as ON DELETE — apply when the referenced primary key is updated.

Current limitation: Only ON UPDATE RESTRICT (the default) is enforced. ON UPDATE CASCADE and ON UPDATE SET NULL return NotImplemented and are planned for Phase 6.10. Write ON UPDATE RESTRICT or omit the clause entirely for correct behaviour today.

CREATE TABLE order_items (
    id         BIGINT PRIMARY KEY AUTO_INCREMENT,
    order_id   BIGINT NOT NULL
        REFERENCES orders(id)
        ON DELETE CASCADE
        ON UPDATE CASCADE,
    product_id BIGINT NOT NULL
        REFERENCES products(id)
        ON DELETE RESTRICT
        ON UPDATE RESTRICT,
    quantity   INT    NOT NULL,
    unit_price DECIMAL NOT NULL
);

CHECK

Validates that a condition is TRUE for every row. A row where the CHECK condition evaluates to FALSE or NULL is rejected.

CREATE TABLE products (
    id     BIGINT  PRIMARY KEY AUTO_INCREMENT,
    name   TEXT    NOT NULL,
    price  DECIMAL NOT NULL CHECK (price > 0),
    stock  INT     NOT NULL CHECK (stock >= 0),
    rating REAL    CHECK (rating IS NULL OR (rating >= 1.0 AND rating <= 5.0))
);

Table-Level Constraints

Table constraints apply to multiple columns and are declared after all column definitions.

CREATE TABLE shipments (
    id           BIGINT    PRIMARY KEY AUTO_INCREMENT,
    order_id     BIGINT    NOT NULL,
    warehouse_id INT       NOT NULL,
    shipped_at   TIMESTAMP,
    delivered_at TIMESTAMP,

    -- Named constraints (recommended for meaningful error messages)
    CONSTRAINT fk_shipment_order
        FOREIGN KEY (order_id) REFERENCES orders(id) ON DELETE CASCADE,

    CONSTRAINT chk_delivery_after_shipment
        CHECK (delivered_at IS NULL OR delivered_at >= shipped_at),

    CONSTRAINT uq_one_active_shipment
        UNIQUE (order_id, warehouse_id)
);

IF NOT EXISTS

Suppresses the error when the table already exists. Useful in migration scripts.

CREATE TABLE IF NOT EXISTS config (
    key   TEXT NOT NULL UNIQUE,
    value TEXT NOT NULL
);

Full Example — E-commerce Schema

CREATE TABLE users (
    id         BIGINT      PRIMARY KEY AUTO_INCREMENT,
    email      TEXT        NOT NULL UNIQUE,
    name       TEXT        NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    deleted_at TIMESTAMPTZ
);

CREATE TABLE categories (
    id   INT  PRIMARY KEY AUTO_INCREMENT,
    name TEXT NOT NULL UNIQUE
);

CREATE TABLE products (
    id          BIGINT      PRIMARY KEY AUTO_INCREMENT,
    category_id INT         NOT NULL REFERENCES categories(id),
    name        TEXT        NOT NULL,
    description TEXT,
    price       DECIMAL     NOT NULL CHECK (price > 0),
    stock       INT         NOT NULL DEFAULT 0 CHECK (stock >= 0),
    created_at  TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE orders (
    id          BIGINT      PRIMARY KEY AUTO_INCREMENT,
    user_id     BIGINT      NOT NULL REFERENCES users(id) ON DELETE RESTRICT,
    total       DECIMAL     NOT NULL CHECK (total >= 0),
    status      TEXT        NOT NULL DEFAULT 'pending',
    placed_at   TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    shipped_at  TIMESTAMPTZ,
    CONSTRAINT chk_order_status CHECK (
        status IN ('pending', 'paid', 'shipped', 'delivered', 'cancelled')
    )
);

CREATE TABLE order_items (
    order_id   BIGINT  NOT NULL REFERENCES orders(id)   ON DELETE CASCADE,
    product_id BIGINT  NOT NULL REFERENCES products(id) ON DELETE RESTRICT,
    quantity   INT     NOT NULL CHECK (quantity > 0),
    unit_price DECIMAL NOT NULL CHECK (unit_price > 0),
    PRIMARY KEY (order_id, product_id)
);

CREATE INDEX

Indexes accelerate lookups and range scans. AxiomDB automatically creates a unique B+ Tree index for every PRIMARY KEY and UNIQUE constraint. Additional indexes are created explicitly. CREATE INDEX works on both heap tables and clustered (PRIMARY KEY) tables.

Basic Syntax

CREATE [UNIQUE] INDEX [IF NOT EXISTS] index_name
ON table_name (column [ASC|DESC], ...)
[WITH (fillfactor = N)]
[WHERE condition];

fillfactor controls how full a B-Tree leaf page gets before splitting (10–100, default 90). Lower values leave room for future inserts without triggering splits. See Fill Factor for details.

Examples

-- Standard index
CREATE INDEX idx_users_email ON users (email);

-- Composite index: queries filtering by (user_id, placed_at) benefit
CREATE INDEX idx_orders_user_date ON orders (user_id, placed_at DESC);

-- Unique index (equivalent to UNIQUE column constraint)
CREATE UNIQUE INDEX uq_products_sku ON products (sku);

-- Partial index: index only active products (reduces index size)
CREATE INDEX idx_active_products ON products (category_id)
WHERE deleted_at IS NULL;

-- Fill factor: append-heavy time-series table (leaves 30% free for inserts)
CREATE INDEX idx_ts ON events(created_at) WITH (fillfactor = 70);

-- Fill factor + partial index combined
CREATE UNIQUE INDEX uq_active_email ON users(email)
WHERE deleted_at IS NULL
-- WITH clause can appear before or after WHERE (both are accepted)

When to Add an Index

Columns appearing in WHERE, JOIN ON, or ORDER BY clauses on large tables
Foreign key columns (AxiomDB does not auto-index FK columns — add them explicitly)
Columns used in range queries (BETWEEN, >, <)

See Indexes for the query planner interaction and composite index column ordering rules.

DROP TABLE

Removes a table and all its data permanently.

DROP TABLE [IF EXISTS] table_name [CASCADE | RESTRICT];

Option	Behavior
`RESTRICT`	Fail if any other table has a foreign key referencing this table (default)
`CASCADE`	Also drop all foreign key constraints that reference this table

-- Safe drop: fails if referenced by other tables
DROP TABLE products;

-- Drop without error if already gone
DROP TABLE IF EXISTS temp_import;

-- Drop even if referenced (removes FK constraints first)
DROP TABLE categories CASCADE;

Dropping a table is immediate and permanent. There is no RECYCLE BIN. Make sure you have a backup or are inside a transaction if you need to recover.

DROP INDEX

Removes an index. The table and its data are not affected.

DROP INDEX [IF EXISTS] index_name;

DROP INDEX idx_users_email;
DROP INDEX IF EXISTS idx_old_lookup;

ALTER TABLE

Modifies the structure of an existing table. All four forms are blocking operations — no concurrent DDL is allowed while an ALTER TABLE is in progress.

Add Column

Adds a new column at the end of the column list. If existing rows are present, they are rewritten to include the default value for the new column. If no DEFAULT clause is given, existing rows receive NULL for that column.

ALTER TABLE table_name ADD COLUMN column_name data_type [NOT NULL] [DEFAULT expr];

-- Add a nullable column (existing rows get NULL)
ALTER TABLE users ADD COLUMN phone TEXT;

-- Add a NOT NULL column with a default (existing rows get 0)
ALTER TABLE orders ADD COLUMN priority INT NOT NULL DEFAULT 0;

-- Add a column with a string default
ALTER TABLE products ADD COLUMN status TEXT NOT NULL DEFAULT 'active';

A column with NOT NULL and no DEFAULT cannot be added to a non-empty table — existing rows would have no value to fill in and would violate the constraint. Provide a DEFAULT value, or add the column as nullable first and back-fill the data before adding the constraint.

⚙️

Design Decision — Row Rewriting on Schema Change AxiomDB rows are stored positionally: each row is a packed binary blob where values are addressed by column index, not by name. The null bitmap and value offsets are fixed at write time according to the schema that was active when the row was inserted. When a column is added or dropped, the column count changes and all existing rows must be rewritten to match the new layout. This is the same approach used by SQLite for its "full table rewrite" DDL path. Rename operations (RENAME COLUMN, RENAME TO) touch only the catalog — no rows are rewritten because column positions do not change.

Drop Column

Removes a column from the table. All existing rows are rewritten without the dropped column’s value. The column name must exist unless IF EXISTS is used.

ALTER TABLE table_name DROP COLUMN column_name [IF EXISTS];

-- Remove a column (fails if the column does not exist)
ALTER TABLE users DROP COLUMN phone;

-- Remove a column only if it exists (idempotent, safe in migrations)
ALTER TABLE users DROP COLUMN phone IF EXISTS;

Dropping a column is permanent. The data stored in that column is discarded when rows are rewritten and cannot be recovered without a backup.

Dropping a column that is part of a UNIQUE index or a FOREIGN KEY is rejected with an error. Drop the index or constraint first, then drop the column. Dropping a PRIMARY KEY column is not allowed on clustered tables (the PK is the physical storage key).

Modify Column

Changes the data type or nullability of an existing column. All existing rows are rewritten, coercing their stored values to the new type.

ALTER TABLE table_name MODIFY COLUMN column_name new_type [NOT NULL];

-- Widen an integer column to 64 bits (existing values preserved)
ALTER TABLE events MODIFY COLUMN count BIGINT;

-- Convert integers to text (always safe, values become their decimal string)
ALTER TABLE codes MODIFY COLUMN code TEXT;

-- Add a NOT NULL constraint (fails if any row has NULL in that column)
ALTER TABLE orders MODIFY COLUMN status TEXT NOT NULL;

Rules and restrictions:

Narrowing casts (e.g. BIGINT → INT, TEXT → INT) are applied with strict coercion. If any existing value cannot be represented in the new type the statement fails and no rows are changed.
A column that is part of a secondary index (UNIQUE or otherwise) cannot have its type changed. Drop the index first, modify the column, then recreate the index.
The PRIMARY KEY column’s type cannot be changed on a clustered table.
Changing nullability from nullable to NOT NULL is allowed only when every existing row has a non-NULL value for that column.

Rename Column

Renames an existing column. This is a catalog-only operation — no rows are rewritten because the positional encoding is not affected by column names.

ALTER TABLE table_name RENAME COLUMN old_name TO new_name;

-- Rename a column
ALTER TABLE users RENAME COLUMN full_name TO display_name;

-- Rename to fix a typo
ALTER TABLE orders RENAME COLUMN shiped_at TO shipped_at;

Rename Table

Renames the table itself. This is a catalog-only operation.

ALTER TABLE old_name RENAME TO new_name;

-- Rename during a refactoring
ALTER TABLE user_profiles RENAME TO profiles;

-- Rename a staging table after a migration
ALTER TABLE orders_import RENAME TO orders;

Rebuild To Clustered

Migrates a legacy heap table that already has PRIMARY KEY metadata into clustered storage.

ALTER TABLE table_name REBUILD;

Example:

-- After opening an older AxiomDB database where `users` is still heap-backed
ALTER TABLE users REBUILD;

Behavior:

walks the existing PRIMARY KEY index in logical key order
rebuilds the table into a clustered PRIMARY KEY tree
rebuilds every non-primary index so it stores clustered PK bookmarks instead of heap RecordIds
swaps the catalog metadata atomically at the end of the statement

Common errors:

ALTER TABLE logs REBUILD;
-- ERROR 1105 (HY000): ALTER TABLE REBUILD requires a PRIMARY KEY on 'logs'

ALTER TABLE users REBUILD;
-- ERROR 1105 (HY000): table 'users' is already clustered

⚙️

Design Decision The rebuild path follows PostgreSQL CLUSTER and InnoDB sorted-rebuild ideas: build the new clustered roots first, then swap catalog metadata. AxiomDB adds deferred free of the old heap/index pages so the metadata swap never races with page reclamation.

Not Yet Supported

The following ALTER TABLE forms are planned for Phase 4.22b and later:

MODIFY COLUMN / ALTER COLUMN — changing a column’s data type
ADD CONSTRAINT — adding a CHECK, UNIQUE, or FOREIGN KEY after table creation
DROP CONSTRAINT — removing a named constraint
Dropping columns that participate in a constraint

TRUNCATE TABLE

Removes all rows from a table without dropping its structure, and resets the AUTO_INCREMENT counter to 1. The table schema, indexes, and constraints are preserved.

TRUNCATE TABLE table_name;

-- Wipe a staging table before re-importing
TRUNCATE TABLE import_staging;

-- AUTO_INCREMENT is always reset after TRUNCATE
CREATE TABLE log_events (id INT AUTO_INCREMENT PRIMARY KEY, msg TEXT);
INSERT INTO log_events (msg) VALUES ('start'), ('end');  -- ids: 1, 2
TRUNCATE TABLE log_events;
INSERT INTO log_events (msg) VALUES ('restart');          -- id: 1

Returns Affected { count: 0 } (MySQL convention). See also TRUNCATE TABLE in the DML reference for a comparison with DELETE FROM table.

ANALYZE

Refreshes per-column statistics used by the query planner to choose between an index scan and a full table scan.

ANALYZE;                          -- all tables in the current schema
ANALYZE TABLE table_name;         -- specific table, all indexed columns
ANALYZE TABLE table_name (col);   -- specific table, one column only

ANALYZE computes exact row_count and NDV (number of distinct non-NULL values) for each target column by scanning the full table. Results are stored in the axiom_stats system catalog and are immediately available to the planner.

-- After a bulk import, refresh stats so the planner uses correct selectivity:
INSERT INTO products SELECT * FROM products_staging;
ANALYZE TABLE products;

-- Check a single column after targeted inserts:
ANALYZE TABLE orders (status);

See Index Statistics for how NDV and row_count affect query planning decisions.

DML — Queries and Mutations

DML statements read and modify table data: SELECT, INSERT, UPDATE, and DELETE. All DML operations participate in the current transaction and are subject to MVCC isolation.

SELECT

Full Syntax

SELECT [DISTINCT] select_list
FROM table_ref [AS alias]
     [JOIN ...]
[WHERE condition]
[GROUP BY column_list]
[HAVING condition]
[ORDER BY column_list [ASC|DESC] [NULLS FIRST|LAST]]
[LIMIT n [OFFSET m]];

Basic Projections

-- All columns
SELECT * FROM users;

-- Specific columns with aliases
SELECT id, email AS user_email, name AS full_name
FROM users;

-- Computed columns
SELECT
    name,
    price * 1.19 AS price_with_tax,
    UPPER(name)  AS name_upper
FROM products;

DISTINCT

Removes duplicate rows from the result. Two rows are duplicates if every selected column has the same value (NULL = NULL for this purpose only).

-- All distinct status values in the orders table
SELECT DISTINCT status FROM orders;

-- All distinct (category_id, status) pairs
SELECT DISTINCT category_id, status FROM products ORDER BY category_id;

FROM and JOIN

Simple FROM

SELECT * FROM products;
SELECT p.* FROM products AS p WHERE p.price > 50;

INNER JOIN

Returns only rows where the join condition matches in both tables.

SELECT
    u.name,
    o.id   AS order_id,
    o.total,
    o.status
FROM users u
INNER JOIN orders o ON o.user_id = u.id
WHERE o.status = 'shipped'
ORDER BY o.placed_at DESC;

LEFT JOIN

Returns all rows from the left table; columns from the right table are NULL when there is no matching row.

-- All users, including those with no orders
SELECT
    u.id,
    u.name,
    COUNT(o.id) AS total_orders
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
GROUP BY u.id, u.name
ORDER BY total_orders DESC;

RIGHT JOIN

Returns all rows from the right table; left table columns are NULL on no match. Less common — most RIGHT JOINs can be rewritten as LEFT JOINs by swapping tables.

SELECT p.name, SUM(oi.quantity) AS total_sold
FROM order_items oi
RIGHT JOIN products p ON p.id = oi.product_id
GROUP BY p.id, p.name;

FULL OUTER JOIN

Returns all rows from both tables. Matched rows are joined normally. Unmatched rows from either side are padded with NULL on the missing side.

AxiomDB extension over the MySQL wire protocol. MySQL does not support FULL OUTER JOIN. AxiomDB clients connecting via the MySQL wire protocol can use it, but standard MySQL clients may not send it.

-- Audit: find users with no orders AND orders with no valid user
SELECT
    u.id   AS user_id,
    u.name AS user_name,
    o.id   AS order_id,
    o.total
FROM users u
FULL OUTER JOIN orders o ON u.id = o.user_id
ORDER BY u.id, o.id;

user_id	user_name	order_id	total
1	Alice	10	100
1	Alice	11	200
2	Bob	12	50
3	Carol	NULL	NULL
NULL	NULL	13	300

Both FULL JOIN and FULL OUTER JOIN are accepted.

ON vs WHERE semantics:

ON predicates are evaluated before null-extension. Rows that do not satisfy ON are treated as unmatched and receive NULLs.
WHERE predicates run after the full join is materialized. Adding WHERE u.id IS NOT NULL removes unmatched right rows from the result.

-- ON vs WHERE: only keep rows where the user side is not NULL
SELECT u.id, o.id
FROM users u
FULL OUTER JOIN orders o ON u.id = o.user_id
WHERE u.id IS NOT NULL;     -- removes the (NULL, 13) row

Nullability: In SELECT * over a FULL OUTER JOIN, all columns from both tables are marked nullable even if the catalog defines them as NOT NULL, because either side can be null-extended.

CROSS JOIN

Cartesian product — every row from the left table combined with every row from the right table. Use with care: m × n rows.

-- Generate all combinations of size and color for a product grid
SELECT sizes.label AS size, colors.label AS color
FROM sizes
CROSS JOIN colors
ORDER BY sizes.sort_order, colors.sort_order;

Multi-Table JOIN

SELECT
    u.name       AS customer,
    p.name       AS product,
    oi.quantity,
    oi.unit_price,
    oi.quantity * oi.unit_price AS line_total
FROM orders o
JOIN users       u  ON u.id  = o.user_id
JOIN order_items oi ON oi.order_id  = o.id
JOIN products    p  ON p.id  = oi.product_id
WHERE o.status = 'delivered'
ORDER BY o.placed_at DESC, p.name;

WHERE

Filters rows before aggregation. Accepts any boolean expression.

-- Equality and comparison
SELECT * FROM products WHERE price > 100 AND stock > 0;

-- NULL check
SELECT * FROM users WHERE deleted_at IS NULL;
SELECT * FROM orders WHERE shipped_at IS NOT NULL;

-- BETWEEN (inclusive on both ends)
SELECT * FROM orders
WHERE placed_at BETWEEN '2026-01-01' AND '2026-03-31';

-- IN list
SELECT * FROM orders WHERE status IN ('pending', 'paid', 'shipped');

-- LIKE pattern matching (% = any sequence, _ = exactly one character)
SELECT * FROM users WHERE email LIKE '%@example.com';
SELECT * FROM products WHERE name LIKE 'USB-_';

-- NOT variants
SELECT * FROM orders WHERE status NOT IN ('cancelled', 'refunded');
SELECT * FROM products WHERE name NOT LIKE 'Test%';

Subqueries

A subquery is a SELECT statement nested inside another statement. AxiomDB supports five subquery forms, each with full NULL semantics identical to PostgreSQL and MySQL.

Scalar Subqueries

A scalar subquery appears anywhere an expression is valid (SELECT list, WHERE, HAVING, ORDER BY). It must return exactly one column. If it returns zero rows, the result is NULL. If it returns more than one row, AxiomDB raises CardinalityViolation (SQLSTATE 21000).

-- Compare each product price against the overall average
SELECT
    name,
    price,
    price - (SELECT AVG(price) FROM products) AS diff_from_avg
FROM products
ORDER BY diff_from_avg DESC;

-- Find the most recently placed order date
SELECT * FROM orders
WHERE placed_at = (SELECT MAX(placed_at) FROM orders);

-- Use a scalar subquery in HAVING
SELECT user_id, COUNT(*) AS order_count
FROM orders
GROUP BY user_id
HAVING COUNT(*) > (SELECT AVG(cnt) FROM (SELECT COUNT(*) AS cnt FROM orders GROUP BY user_id) AS sub);

If the subquery returns more than one row, AxiomDB raises:

ERROR 21000: subquery must return exactly one row, but returned 3 rows

Use LIMIT 1 or a unique WHERE predicate to guarantee a single row.

IN Subquery

expr [NOT] IN (SELECT col FROM ...) tests whether a value appears in the set of values produced by the subquery.

-- Orders for users who have placed more than 5 orders total
SELECT * FROM orders
WHERE user_id IN (
    SELECT user_id FROM orders GROUP BY user_id HAVING COUNT(*) > 5
);

-- Products never sold
SELECT * FROM products
WHERE id NOT IN (
    SELECT DISTINCT product_id FROM order_items
);

NULL semantics — fully consistent with the SQL standard:

Value in outer expr	Subquery result	Result
`'Alice'`	contains `'Alice'`	`TRUE`
`'Alice'`	does not contain `'Alice'`, no NULLs	`FALSE`
`'Alice'`	does not contain `'Alice'`, contains `NULL`	`NULL`
`NULL`	any non-empty set	`NULL`
`NULL`	empty set	`NULL`

The third row is the subtle case: x NOT IN (subquery with NULLs) returns NULL, not FALSE. This means NOT IN combined with a subquery that may produce NULLs can silently exclude rows. A safe alternative is NOT EXISTS.

EXISTS / NOT EXISTS

[NOT] EXISTS (SELECT ...) tests whether the subquery produces at least one row. The result is always TRUE or FALSE — never NULL.

-- Users who have at least one paid order
SELECT * FROM users u
WHERE EXISTS (
    SELECT 1 FROM orders o
    WHERE o.user_id = u.id AND o.status = 'paid'
);

-- Products with no associated order items
SELECT * FROM products p
WHERE NOT EXISTS (
    SELECT 1 FROM order_items oi WHERE oi.product_id = p.id
);

The select list inside an EXISTS subquery does not matter — SELECT 1, SELECT *, and SELECT id all behave identically. The engine only checks for row existence.

Correlated Subqueries

A correlated subquery references columns from the outer query. AxiomDB re-executes the subquery for each outer row, substituting the current outer column values.

-- For each order, fetch the user's name (correlated scalar subquery in SELECT list)
SELECT
    o.id,
    o.total,
    (SELECT u.name FROM users u WHERE u.id = o.user_id) AS customer_name
FROM orders o;

-- Orders whose total exceeds the average total for that user (correlated in WHERE)
SELECT * FROM orders o
WHERE o.total > (
    SELECT AVG(total) FROM orders WHERE user_id = o.user_id
);

-- Active products with above-average stock in their category
SELECT * FROM products p
WHERE p.stock > (
    SELECT AVG(stock) FROM products WHERE category_id = p.category_id
);

Correlated subqueries with large outer result sets can be slow (O(n) re-executions). For performance-critical paths, rewrite them as JOINs with aggregation.

Derived Tables (FROM Subquery)

A subquery in the FROM clause is called a derived table. It must have an alias. AxiomDB materializes the derived table result in memory before executing the outer query.

-- Top spenders, computed as a subquery and then filtered
SELECT customer_name, total_spent
FROM (
    SELECT u.name AS customer_name, SUM(o.total) AS total_spent
    FROM users u
    JOIN orders o ON o.user_id = u.id
    WHERE o.status = 'delivered'
    GROUP BY u.id, u.name
) AS spending
WHERE total_spent > 500
ORDER BY total_spent DESC;

-- Percentile bucketing: compute rank in a subquery, filter in outer
SELECT *
FROM (
    SELECT
        id,
        name,
        price,
        RANK() OVER (ORDER BY price DESC) AS price_rank
    FROM products
) AS ranked
WHERE price_rank <= 10;

⚙️

Design Decision — Full SQL Standard NULL Semantics AxiomDB implements the same three-valued logic for IN (subquery) as PostgreSQL and MySQL: a non-matching lookup against a set that contains NULL returns NULL, not FALSE. This matches ISO SQL:2016 and avoids the "missing row" trap that catches developers when NOT IN is used against a nullable foreign key column. Every subquery form (scalar, IN, EXISTS, correlated, derived table) follows the same rules as PostgreSQL 15.

GROUP BY and HAVING

GROUP BY collapses rows with the same values in the specified columns into a single output row. Aggregate functions operate over each group.

🚀

Automatic Sorted Grouping When the query uses an indexed column as the GROUP BY key and the chosen B-Tree access method already delivers rows in key order, AxiomDB automatically switches to a streaming sorted executor — no hash table, O(1) memory per group. Unlike PostgreSQL, which requires a separate GroupAggregate plan node, AxiomDB selects the strategy transparently at execution time.

-- Orders per user
SELECT user_id, COUNT(*) AS order_count, SUM(total) AS revenue
FROM orders
GROUP BY user_id
ORDER BY revenue DESC;

-- Monthly revenue
SELECT
    DATE_TRUNC('month', placed_at) AS month,
    COUNT(*)   AS orders,
    SUM(total) AS revenue,
    AVG(total) AS avg_order_value
FROM orders
WHERE status != 'cancelled'
GROUP BY DATE_TRUNC('month', placed_at)
ORDER BY month;

HAVING filters groups after aggregation (analogous to WHERE for rows).

-- Only users with more than 5 orders
SELECT user_id, COUNT(*) AS order_count
FROM orders
GROUP BY user_id
HAVING COUNT(*) > 5
ORDER BY order_count DESC;

-- Only categories with average price above 50
SELECT category_id, AVG(price) AS avg_price
FROM products
WHERE deleted_at IS NULL
GROUP BY category_id
HAVING AVG(price) > 50;

ORDER BY

Sorts the result. Multiple columns are sorted left to right.

-- Descending by total, then ascending by name as tiebreaker
SELECT user_id, SUM(total) AS revenue
FROM orders
GROUP BY user_id
ORDER BY revenue DESC, user_id ASC;

NULLS FIRST / NULLS LAST

Controls where NULL values appear in the sort order.

-- Show NULL shipped_at rows at the bottom (unshipped orders last)
SELECT id, total, shipped_at
FROM orders
ORDER BY shipped_at ASC NULLS LAST;

-- Show most recent shipments first; unshipped at top
SELECT id, total, shipped_at
FROM orders
ORDER BY shipped_at DESC NULLS FIRST;

Default behavior: ASC sorts NULL last; DESC sorts NULL first (same as PostgreSQL).

LIMIT and OFFSET

-- First 10 rows
SELECT * FROM products ORDER BY name LIMIT 10;

-- Rows 11-20 (page 2 with page size 10)
SELECT * FROM products ORDER BY name LIMIT 10 OFFSET 10;

-- Common pagination pattern
SELECT * FROM products
ORDER BY created_at DESC
LIMIT 20 OFFSET 40;   -- page 3 (0-indexed) of 20 items per page

For large offsets (> 10,000), consider keyset pagination instead: WHERE id > :last_seen_id ORDER BY id LIMIT 20

INSERT

🚀

O(1) heap tail lookup AxiomDB caches the last heap page per table in the session context (HeapAppendHint). Repeated INSERTs in the same session no longer walk the full chain from the root page on every row — the tail is resolved in one page read and self-healed on mismatch. This eliminates the O(N²) degradation seen at 100K+ rows in a single session.

INSERT … VALUES

Tables whose schema has an explicit PRIMARY KEY now use clustered storage for SQL-visible INSERT, SELECT, UPDATE, and DELETE. The clustered SQL path now supports:

single-row VALUES
multi-row VALUES
INSERT ... SELECT
AUTO_INCREMENT
explicit transactions and savepoints
SELECT full scans over clustered leaves
SELECT PK point lookups and PK range scans
SELECT secondary lookups through PK bookmarks stored in the secondary key
UPDATE in-place rewrite when the row still fits in the owning leaf
UPDATE relocation fallback when the row grows and must be rewritten structurally
UPDATE through PK predicates or secondary bookmark probes with transaction rollback/savepoint safety
DELETE through PK predicates, PK ranges, secondary bookmark probes, or full clustered scans
DELETE rollback/savepoint restore through exact clustered row images in WAL

Current clustered boundary after 39.18:

clustered DELETE is still delete-mark first, and clustered VACUUM table performs the later physical purge
clustered VACUUM table now frees overflow chains and dead secondary bookmark entries
clustered child-table foreign-key enforcement still remains future work

⚙️

Design Decision — Insert Uses Clustered Identity On an explicit-PK table, AxiomDB now inserts straight into the clustered PK tree and derives secondary entries from the PRIMARY KEY bookmark. That mirrors SQLite WITHOUT ROWID more closely than a heap-first compatibility layer would.

When a table has an AUTO_INCREMENT column, omit it from the column list and AxiomDB generates the next sequential ID automatically. Use LAST_INSERT_ID() (or the PostgreSQL alias lastval()) immediately after the INSERT to retrieve the generated value.

CREATE TABLE users (
    id   BIGINT PRIMARY KEY AUTO_INCREMENT,
    name TEXT   NOT NULL
);

-- Single row — id is generated automatically
INSERT INTO users (name) VALUES ('Alice');
-- id=1

SELECT LAST_INSERT_ID();   -- returns 1

For multi-row INSERT, LAST_INSERT_ID() returns the ID generated for the first row of the batch (MySQL semantics). Subsequent rows receive consecutive IDs.

💡

Tip — Explicit Transaction Staging When your client sends many one-row INSERT ... VALUES statements, wrap them in BEGIN ... COMMIT. AxiomDB stages consecutive INSERTs for the same table inside the transaction and flushes them together on COMMIT or the next barrier statement.

INSERT INTO users (name) VALUES ('Bob'), ('Carol'), ('Dave');
-- ids: 2, 3, 4
SELECT LAST_INSERT_ID();   -- returns 2 (first of the batch)

Supplying an explicit non-NULL value in the AUTO_INCREMENT column bypasses the sequence and does not advance it.

INSERT INTO users (id, name) VALUES (100, 'Eve');
-- id=100; sequence not advanced; next LAST_INSERT_ID() still returns 2

The same AUTO_INCREMENT contract now applies to clustered explicit-PK tables: AxiomDB bootstraps the next value by scanning the clustered rows for the current maximum instead of falling back to heap metadata.

See Expressions — Session Functions for full LAST_INSERT_ID() / lastval() semantics.

-- Single row
INSERT INTO users (name, email, age)
VALUES ('Alice', 'alice@example.com', 30);

-- Multiple rows in one statement (more efficient than individual INSERTs)
INSERT INTO products (name, price, stock) VALUES
    ('Keyboard', 49.99, 100),
    ('Mouse',    29.99, 200),
    ('Monitor', 299.99,  50);

INSERT … DEFAULT VALUES

Inserts a single row using all column defaults. Useful when every column has a default.

CREATE TABLE audit_events (
    id         BIGINT      PRIMARY KEY AUTO_INCREMENT,
    created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    event_type TEXT        NOT NULL DEFAULT 'unknown'
);

INSERT INTO audit_events DEFAULT VALUES;
-- Row: id=1, created_at=<now>, event_type='unknown'

INSERT … SELECT

Inserts rows generated by a SELECT statement. Useful for bulk copies and migrations.

-- Copy all active users to an archive table
INSERT INTO users_archive (id, email, name, created_at)
SELECT id, email, name, created_at
FROM users
WHERE deleted_at IS NOT NULL;

-- Compute and store aggregates
INSERT INTO monthly_revenue (month, total)
SELECT
    DATE_TRUNC('month', placed_at),
    SUM(total)
FROM orders
WHERE status = 'delivered'
GROUP BY 1;

UPDATE

Modifies existing rows. All matching rows are updated in a single statement.

UPDATE table_name
SET column = expression [, column = expression ...]
[WHERE condition];

-- Mark a specific order as shipped
UPDATE orders
SET status = 'shipped', shipped_at = CURRENT_TIMESTAMP
WHERE id = 42;

-- Apply a 10% discount to all products in a category
UPDATE products
SET price = price * 0.90
WHERE category_id = 5 AND deleted_at IS NULL;

-- Reset all pending orders older than 7 days to cancelled
UPDATE orders
SET status = 'cancelled'
WHERE status = 'pending'
  AND placed_at < CURRENT_TIMESTAMP - INTERVAL '7 days';

An UPDATE without a WHERE clause updates every row in the table. This is rarely what you want. Always double-check before running unbounded updates in production.

DELETE

Removes rows from a table.

DELETE FROM table_name [WHERE condition];

-- Delete a specific row
DELETE FROM sessions WHERE id = 'abc123';

-- Delete all expired sessions
DELETE FROM sessions WHERE expires_at < CURRENT_TIMESTAMP;

-- Soft delete pattern (prefer UPDATE to mark rows inactive)
UPDATE users SET deleted_at = CURRENT_TIMESTAMP WHERE id = 7;
-- Then filter: SELECT * FROM users WHERE deleted_at IS NULL;

🚀

Bulk-empty fast path DELETE FROM t without a WHERE clause uses a root-rotation fast path instead of per-row B-Tree deletes. New empty heap and index roots are allocated, the catalog is updated atomically inside the transaction, and old pages are freed only after WAL fsync confirms commit durability. This eliminates the 10,000× slowdown that previously occurred when a table had any index (PK, UNIQUE, or secondary). The operation is fully transactional: ROLLBACK restores original roots.

When parent FK references exist, DELETE FROM t keeps the row-by-row path so RESTRICT/CASCADE/SET NULL FK enforcement still fires correctly.

🚀

Indexed DELETE WHERE DELETE ... WHERE col = value or WHERE col > lo uses the available index to discover candidate rows instead of scanning the full heap. The planner always prefers the index for DELETE (unlike SELECT, which may reject an index when selectivity is too low) because avoiding a heap scan is always beneficial even when many rows match. The full WHERE predicate is rechecked on fetched rows before deletion.

TRUNCATE TABLE

Removes all rows from a table and resets its AUTO_INCREMENT counter to 1. The table structure, indexes, and constraints are preserved.

TRUNCATE TABLE table_name;

-- Empty a staging table before a fresh import
TRUNCATE TABLE import_staging;

-- After truncate, AUTO_INCREMENT restarts from 1
CREATE TABLE counters (id INT AUTO_INCREMENT PRIMARY KEY, label TEXT);
INSERT INTO counters (label) VALUES ('a'), ('b');  -- ids: 1, 2
TRUNCATE TABLE counters;
INSERT INTO counters (label) VALUES ('c');          -- id: 1 (reset)

TRUNCATE TABLE returns Affected { count: 0 }, matching MySQL convention.

TRUNCATE vs DELETE — when to use each:

	`DELETE FROM t`	`TRUNCATE TABLE t`
Rows removed	All (without WHERE)	All
WHERE clause	Supported	Not supported
AUTO_INCREMENT	Not reset	Reset to 1
Rows affected	Returns actual count	Returns 0
FK parent table	Row-by-row (enforces FK)	Fails if child FKs exist
Typical use	Conditional deletes	Full table wipe

TRUNCATE TABLE fails with an error if any FK constraint references the table as the parent. Delete or truncate child tables first, then truncate the parent.

Both DELETE FROM t (no WHERE) and TRUNCATE TABLE t use the same bulk-empty root-rotation machinery internally and are fully transactional.

Session Variables

Session variables hold connection-scoped state. Read them with SELECT @@name and change them with SET name = value.

Reading session variables

SELECT @@autocommit;        -- 1 (autocommit on) or 0 (autocommit off)
SELECT @@in_transaction;    -- 1 inside an active transaction, 0 otherwise
SELECT @@version;           -- '8.0.36-AxiomDB-0.1.0'
SELECT @@character_set_client;   -- 'utf8mb4'
SELECT @@transaction_isolation;  -- 'REPEATABLE-READ'

Supported variables

Variable	Default	Description
`@@autocommit`	`1`	`1` = each statement auto-committed; `0` = explicit COMMIT required
`@@axiom_compat`	`'standard'`	Compatibility mode — controls default session collation (see AXIOM_COMPAT)
`@@collation`	`'binary'`	Executor-visible text semantics — `binary` or `es` (see AXIOM_COMPAT)
`@@in_transaction`	`0`	`1` when inside an active transaction, `0` otherwise
`@@on_error`	`'rollback_statement'`	How statement errors affect the transaction (see ON_ERROR)
`@@version`	`'8.0.36-AxiomDB-0.1.0'`	Server version (MySQL 8 compatible format)
`@@version_comment`	`'AxiomDB'`	Server variant
`@@character_set_client`	`'utf8mb4'`	Client character set
`@@character_set_results`	`'utf8mb4'`	Result character set
`@@collation_connection`	`'utf8mb4_general_ci'`	Connection collation
`@@max_allowed_packet`	`67108864`	Maximum packet size (64 MB)
`@@sql_mode`	`'STRICT_TRANS_TABLES'`	Active SQL mode (see Strict Mode)
`@@strict_mode`	`'ON'`	AxiomDB strict coercion flag (alias for `STRICT_TRANS_TABLES` in sql_mode)
`@@transaction_isolation`	`'REPEATABLE-READ'`	Isolation level

Changing session variables

-- Switch to manual transaction mode (used by SQLAlchemy, Django ORM, etc.)
SET autocommit = 0;
SET autocommit = 1;   -- restore

-- Character set (accepted for ORM compatibility, utf8mb4 is always used internally)
SET NAMES 'utf8mb4';
SET character_set_client = 'utf8mb4';

-- Control coercion strictness (see Strict Mode below)
SET strict_mode = OFF;
SET sql_mode = '';

@@in_transaction — transaction state check

SELECT @@in_transaction;    -- 0 — no transaction active

INSERT INTO t VALUES (1);   -- starts implicit txn when autocommit=0
SELECT @@in_transaction;    -- 1 — inside transaction

COMMIT;
SELECT @@in_transaction;    -- 0 — transaction closed

Use @@in_transaction to verify transaction state before issuing a COMMIT or ROLLBACK. This avoids the warning generated when COMMIT is called with no active transaction.

AXIOM_COMPAT and collation

@@axiom_compat controls the high-level compatibility behavior of the session. @@collation controls how text values are compared, sorted, and grouped.

SET AXIOM_COMPAT = 'mysql';          -- CI+AI text semantics (default collation = 'es')
SET AXIOM_COMPAT = 'postgresql';     -- exact binary text semantics
SET AXIOM_COMPAT = 'standard';       -- default AxiomDB behavior (binary)
SET AXIOM_COMPAT = DEFAULT;          -- reset to 'standard'

SET collation = 'es';                -- explicit CI+AI fold for this session
SET collation = 'binary';            -- explicit exact byte order
SET collation = DEFAULT;             -- restore compat-derived default

`binary` collation (default)

Exact byte-order string comparison — current AxiomDB default:

'a' != 'A', 'a' != 'á'
LIKE is case-sensitive and accent-sensitive
GROUP BY, DISTINCT, ORDER BY, MIN/MAX(TEXT) all use raw byte order

`es` collation — CI+AI fold

A lightweight session-level CI+AI fold: NFC normalize → lowercase → strip combining accent marks. No ICU / CLDR dependency.

'Jose' = 'JOSE' = 'José' compare equal
LIKE 'jos%' matches José
GROUP BY, DISTINCT, COUNT(DISTINCT ...) collapse accent/case variants into one group
ORDER BY sorts by folded text first, raw text as a tie-break for determinism
MIN/MAX(TEXT) and GROUP_CONCAT(DISTINCT/ORDER BY ...) respect the fold

-- Binary (default): José and jose are different rows
SELECT name FROM users GROUP BY name;
-- → 'José', 'jose', 'JOSE'

-- Es: all three fold to "jose" — one group
SET AXIOM_COMPAT = 'mysql';
SELECT name FROM users GROUP BY name;
-- → 'José'   (or whichever variant appears first)

-- Explicit collation independent of compat mode:
SET collation = 'es';
SELECT * FROM products WHERE name = 'widget';  -- matches Widget, WIDGET, wídget

Index safety: When @@collation = 'es', AxiomDB automatically falls back from text index lookups to full table scans for correctness. Binary-ordered B-Tree keys do not match es-folded predicates, so using the index would silently miss rows. Non-text indexes (INT, BIGINT, DATE, etc.) are unaffected.

Note: @@collation and @@collation_connection are separate variables. @@collation_connection is the transport charset (set during handshake or via SET NAMES). @@collation is the executor-visible text-comparison behavior added by AXIOM_COMPAT.

Full layered collation (per-database, per-column, ICU locale) is planned for Phase 13.13.

ON_ERROR

@@on_error controls what happens to the current transaction when a statement fails. It applies to all pipeline stages: parse errors, semantic errors, and executor errors.

SET on_error = 'rollback_statement';    -- default
SET on_error = 'rollback_transaction';
SET on_error = 'savepoint';
SET on_error = 'ignore';
SET on_error = DEFAULT;                  -- reset to rollback_statement

Both quoted strings and bare identifiers are accepted:

SET on_error = rollback_statement;      -- same as 'rollback_statement'

Modes

rollback_statement (default) — When a statement fails inside an active transaction, only that statement’s writes are rolled back. The transaction stays open. This matches MySQL’s statement-level rollback behavior.

BEGIN;
INSERT INTO t VALUES (1);          -- ok
INSERT INTO t VALUES (1);          -- ERROR: duplicate key
-- transaction still active, id=1 is the only write that will commit
INSERT INTO t VALUES (2);          -- ok
COMMIT;                            -- commits id=1 and id=2

rollback_transaction — When any statement fails inside an active transaction, the entire transaction is rolled back immediately. @@in_transaction becomes 0.

SET on_error = 'rollback_transaction';

BEGIN;
INSERT INTO t VALUES (1);          -- ok
INSERT INTO t VALUES (1);          -- ERROR: duplicate key → whole txn rolled back
SELECT @@in_transaction;           -- 0 — transaction is gone

⚙️

Eager Rollback vs PostgreSQL Abort Latch PostgreSQL keeps the transaction open after an error in a "aborted" state (SQLSTATE 25P02) where every subsequent statement returns ERROR: current transaction is aborted until the client sends ROLLBACK. AxiomDB's rollback_transaction uses eager rollback instead: the transaction is closed immediately on error, so the client starts fresh without needing an explicit ROLLBACK.

savepoint — Same as rollback_statement when a transaction is already active. When autocommit = 0, the key difference appears on the first DML in an implicit transaction: savepoint preserves the implicit transaction after a failing first DML, while rollback_statement closes it.

SET autocommit = 0;
SET on_error = 'savepoint';

INSERT INTO t VALUES (999);        -- fails (dup key)
SELECT @@in_transaction;           -- 1 — implicit txn stays open
INSERT INTO t VALUES (1);          -- ok, continues in the same txn
COMMIT;

ignore — Ignorable SQL errors (parse errors, semantic errors, constraint violations, type mismatches) are converted to session warnings and the statement is reported as success. Non-ignorable errors (I/O failures, WAL errors, storage corruption) still return ERR; if one happens inside an active transaction, AxiomDB eagerly rolls that transaction back before returning the error.

SET on_error = 'ignore';

INSERT INTO t VALUES (1);          -- ok
INSERT INTO t VALUES (1);          -- duplicate key → silently ignored
SHOW WARNINGS;                     -- shows code 1062 + original message
INSERT INTO t VALUES (2);          -- ok, continues
COMMIT;                            -- commits id=1 and id=2

In a multi-statement COM_QUERY, ignore continues executing later statements after an ignored error.

-- Single COM_QUERY with three statements:
INSERT INTO t VALUES (1); INSERT INTO t VALUES (1); INSERT INTO t VALUES (2);
-- First succeeds, second is ignored (dup), third succeeds.
-- Only the ignored statement's OK packet carries warning_count > 0.

Inspecting the current mode

SELECT @@on_error;                  -- 'rollback_statement'
SELECT @@session.on_error;          -- same
SHOW VARIABLES LIKE 'on_error';     -- on_error | rollback_statement

COM_RESET_CONNECTION resets @@on_error to rollback_statement.

Strict Mode

AxiomDB operates in strict mode by default. In strict mode, an INSERT or UPDATE that cannot coerce a value to the column’s declared type returns an error immediately (SQLSTATE 22018). This prevents silent data corruption.

CREATE TABLE products (name TEXT, stock INT);

-- Strict mode (default): error on bad coercion
INSERT INTO products VALUES ('Widget', 'abc');
-- ERROR 22018: cannot coerce 'abc' (Text) to INT

To enable permissive mode, disable strict mode for the session:

SET strict_mode = OFF;
-- or equivalently:
SET sql_mode = '';

In permissive mode, AxiomDB first tries the strict coercion. If it fails, it falls back to a best-effort conversion (e.g. '42abc' → 42, 'abc' → 0), stores the result, and emits warning 1265 instead of returning an error:

SET strict_mode = OFF;

CREATE TABLE products (name TEXT, stock INT);
INSERT INTO products VALUES ('Widget', '99abc');
-- Succeeds — stock stored as 99; warning emitted

SHOW WARNINGS;
-- Level    Code   Message
-- ─────────────────────────────────────────────────────────────────────
-- Warning  1265   Data truncated for column 'stock' at row 1

For multi-row INSERT, the row number in warning 1265 is 1-based and identifies the specific row that triggered the fallback:

INSERT INTO products VALUES ('A', '10'), ('B', '99x'), ('C', '30');
SHOW WARNINGS;
-- Warning  1265   Data truncated for column 'stock' at row 2

Re-enable strict mode at any time:

SET strict_mode = ON;
-- or equivalently:
SET sql_mode = 'STRICT_TRANS_TABLES';

SET strict_mode = DEFAULT also restores the server default (ON).

💡

Tip — ORM Compatibility Some ORMs (e.g. older SQLAlchemy versions, legacy Rails) set sql_mode = '' at connection time to get MySQL 5 permissive behavior. AxiomDB supports this pattern: SET sql_mode = '' disables strict mode for that connection. Use SHOW WARNINGS after bulk loads to audit truncated values.

SHOW WARNINGS

After any statement that completes with warnings, query the warning list:

-- Warning from no-op COMMIT
COMMIT;               -- no active transaction — emits warning 1592
SHOW WARNINGS;
-- Level    Code   Message
-- ───────────────────────────────────────────────
-- Warning  1592   There is no active transaction

-- Warning from permissive coercion (strict_mode = OFF)
SET strict_mode = OFF;
INSERT INTO products VALUES ('Widget', '99abc');
SHOW WARNINGS;
-- Level    Code   Message
-- ─────────────────────────────────────────────────────────────────────
-- Warning  1265   Data truncated for column 'stock' at row 1

SHOW WARNINGS returns the warnings from the most recent statement only. The list is cleared before each new statement executes.

Warning Code	Condition
`1265`	Permissive coercion fallback: value was truncated/converted to fit the column type
`1592`	COMMIT or ROLLBACK issued with no active transaction

SHOW TABLES

Lists all tables in the current schema (or a named schema).

SHOW TABLES;
SHOW TABLES FROM schema_name;

The result set has a single column named Tables_in_<schema>:

SHOW TABLES;
-- Tables_in_public
-- ────────────────
-- users
-- orders
-- products
-- order_items

SHOW COLUMNS / DESCRIBE

Returns the column definitions of a table.

SHOW COLUMNS FROM table_name;
DESCRIBE table_name;
DESC table_name;            -- shorthand

All three forms are equivalent. The result has six columns:

Column	Description
`Field`	Column name
`Type`	Data type as declared in `CREATE TABLE`
`Null`	`YES` if the column accepts NULL, `NO` otherwise
`Key`	`PRI` for primary key columns; empty otherwise (stub)
`Default`	Default expression, or `NULL` if none (stub)
`Extra`	`auto_increment` for AUTO_INCREMENT columns; empty otherwise

CREATE TABLE users (
    id   BIGINT PRIMARY KEY AUTO_INCREMENT,
    name TEXT   NOT NULL,
    bio  TEXT
);

DESCRIBE users;
-- Field  Type    Null  Key  Default  Extra
-- ─────────────────────────────────────────────────
-- id     BIGINT  NO    PRI  NULL     auto_increment
-- name   TEXT    NO         NULL
-- bio    TEXT    YES        NULL

The Key and Default columns are stubs in the current release and do not yet reflect all constraints or computed defaults. Full metadata is tracked internally in the catalog and will be exposed in a future release.

Practical Examples — E-commerce Queries

Checkout: Atomic Order Placement

BEGIN;

-- Verify stock before committing
SELECT stock FROM products WHERE id = 1 AND stock >= 2;
-- If no row returned, rollback

INSERT INTO orders (user_id, total, status)
VALUES (99, 99.98, 'paid');

INSERT INTO order_items (order_id, product_id, quantity, unit_price)
VALUES (LAST_INSERT_ID(), 1, 2, 49.99);

UPDATE products SET stock = stock - 2 WHERE id = 1;

COMMIT;

Revenue Report — Last 30 Days

SELECT
    p.name                          AS product,
    SUM(oi.quantity)                AS units_sold,
    SUM(oi.quantity * oi.unit_price) AS revenue
FROM order_items oi
JOIN orders  o ON o.id = oi.order_id
JOIN products p ON p.id = oi.product_id
WHERE o.placed_at >= CURRENT_TIMESTAMP - INTERVAL '30 days'
  AND o.status IN ('paid', 'shipped', 'delivered')
GROUP BY p.id, p.name
ORDER BY revenue DESC
LIMIT 10;

User Activity Summary

SELECT
    u.id,
    u.name,
    u.email,
    COUNT(o.id)   AS total_orders,
    SUM(o.total)  AS lifetime_value,
    MAX(o.placed_at) AS last_order
FROM users u
LEFT JOIN orders o ON o.user_id = u.id AND o.status != 'cancelled'
WHERE u.deleted_at IS NULL
GROUP BY u.id, u.name, u.email
ORDER BY lifetime_value DESC NULLS LAST;

Multi-Statement Queries

AxiomDB accepts multiple SQL statements separated by ; in a single COM_QUERY call. Each statement executes sequentially, and the client receives one result set per statement.

-- Three statements in one call
CREATE TABLE IF NOT EXISTS sessions (
    id         UUID NOT NULL,
    user_id    INT  NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
INSERT INTO sessions (id, user_id) VALUES (gen_random_uuid(), 42);
SELECT COUNT(*) FROM sessions WHERE user_id = 42;

How it works (protocol):

Each intermediate result set is sent with the SERVER_MORE_RESULTS_EXISTS flag (0x0008) set in the EOF/OK status bytes, telling the client to read the next result set. The final result set has the flag cleared.

Behavior on error:

If any statement fails, execution stops at that point and an error packet is sent. Statements after the failing one are not executed.

-- If INSERT fails (e.g. UNIQUE violation), SELECT is not executed
INSERT INTO users (email) VALUES ('duplicate@example.com');
SELECT * FROM users WHERE email = 'duplicate@example.com';

💡

Tip — SQL scripts and migrations Multi-statement support makes it easy to run SQL migration scripts directly via the MySQL wire protocol. The mysql CLI, pymysql, and most ORMs handle multi-statement results automatically when the client flag CLIENT_MULTI_STATEMENTS is set (default in most clients).

ALTER TABLE — Constraints

ADD CONSTRAINT UNIQUE

-- Named unique constraint (recommended for DROP CONSTRAINT later)
ALTER TABLE users ADD CONSTRAINT uq_users_email UNIQUE (email);

-- Anonymous unique constraint (auto-named)
ALTER TABLE users ADD UNIQUE (username);

ADD CONSTRAINT UNIQUE creates a unique index internally. Fails with IndexAlreadyExists if a constraint/index with that name already exists on the table, or UniqueViolation if the column already has duplicate values.

ADD CONSTRAINT CHECK

ALTER TABLE orders ADD CONSTRAINT chk_positive_amount CHECK (amount > 0);
ALTER TABLE products ADD CONSTRAINT chk_stock CHECK (stock >= 0);

The CHECK expression is validated against all existing rows at the time of the ALTER TABLE. If any row fails the check, the statement returns CheckViolation. After the constraint is added, every subsequent INSERT and UPDATE on the table evaluates the expression.

DROP CONSTRAINT

-- Drop by name (works for both UNIQUE and CHECK constraints)
ALTER TABLE users DROP CONSTRAINT uq_users_email;

-- Silent no-op if the constraint does not exist
ALTER TABLE users DROP CONSTRAINT IF EXISTS uq_users_old;

DROP CONSTRAINT searches first in indexes (for UNIQUE constraints), then in the named constraint catalog (for CHECK constraints).

ADD CONSTRAINT FOREIGN KEY (Phase 6.5)

Adds a foreign key constraint after the table is created. Validates all existing rows before persisting — fails if any existing value violates the new constraint.

ALTER TABLE orders
  ADD CONSTRAINT fk_user FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE;

Fails if any existing user_id value has no matching row in users.

Limitations

-- Not yet supported:
ALTER TABLE users ADD CONSTRAINT pk_users PRIMARY KEY (id);
-- → NotImplemented: ADD CONSTRAINT PRIMARY KEY — requires full table rewrite

Prepared Statements — Binary Protocol

AxiomDB supports the full MySQL binary prepared statement protocol, including large parameter transmission via COM_STMT_SEND_LONG_DATA.

Large parameters (BLOB / TEXT)

When a parameter value is too large to send in a single COM_STMT_EXECUTE packet, client libraries split it into multiple COM_STMT_SEND_LONG_DATA chunks before execute. AxiomDB buffers all chunks and assembles the final value at execute time.

💡

Chunk Boundaries Are Safe AxiomDB buffers long-data chunks as raw bytes and decodes text only at execute time. A UTF-8 character may be split across packets without corrupting the final value.

Python (PyMySQL):

import pymysql, os

conn = pymysql.connect(host="127.0.0.1", port=3306, user="root", db="test")
cur = conn.cursor()
cur.execute("CREATE TABLE IF NOT EXISTS files (id INT, data LONGBLOB)")

# PyMySQL automatically uses COM_STMT_SEND_LONG_DATA for values > 8 KB
large_blob = os.urandom(64 * 1024)   # 64 KB binary data
cur.execute("INSERT INTO files VALUES (%s, %s)", (1, large_blob))
conn.commit()

Binary parameters (BLOB, LONGBLOB, MEDIUMBLOB, TINYBLOB) are stored as raw bytes — 0x00 bytes and non-UTF-8 sequences are preserved exactly.

Text parameters (VARCHAR, TEXT, LONGTEXT) are decoded with the connection’s character_set_client after all chunks are assembled, so multibyte characters split across chunk boundaries are reconstructed correctly.

Parameter type mapping

MySQL type	AxiomDB type	Notes
`MYSQL_TYPE_STRING` / `VAR_STRING` / `VARCHAR`	`TEXT`	UTF-8 decoded
`MYSQL_TYPE_BLOB` / `TINY_BLOB` / `MEDIUM_BLOB` / `LONG_BLOB`	`BYTES`	Raw bytes, no charset
`MYSQL_TYPE_LONG` / `LONGLONG`	`INT` / `BIGINT`
`MYSQL_TYPE_FLOAT` / `DOUBLE`	`REAL`
`MYSQL_TYPE_DATE`	`DATE`
`MYSQL_TYPE_DATETIME`	`TIMESTAMP`

COM_STMT_RESET

Calling mysql_stmt_reset() (or the equivalent in any MySQL driver) clears any pending long-data buffers for that statement without deallocating the prepared statement itself. The statement can then be re-executed with fresh parameters.

SHOW STATUS counter

SHOW STATUS LIKE 'Com_stmt_send_long_data' reports how many long-data chunks have been received by the current session (session scope) or by the server since startup (global scope).

SHOW STATUS LIKE 'Com_stmt_send_long_data';
-- Variable_name                | Value
-- Com_stmt_send_long_data      | 3

Expressions and Operators

An expression is any construct that evaluates to a value. Expressions appear in SELECT projections, WHERE conditions, ORDER BY clauses, CHECK constraints, and DEFAULT values.

Operator Precedence

From highest to lowest binding (higher = evaluated first):

Level	Operators	Associativity
1	`()` parentheses	—
2	Unary `-`, `NOT`	Right
3	`*`, `/`, `%`	Left
4	`+`, `-`	Left
5	`=`, `<>`, `!=`, `<`, `<=`, `>`, `>=`	—
6	`IS NULL`, `IS NOT NULL`, `BETWEEN`, `LIKE`, `IN`	—
7	`AND`	Left
8	`OR`	Left

Use parentheses to make complex expressions explicit:

-- Without parens: AND binds tighter than OR
SELECT * FROM orders WHERE status = 'paid' OR status = 'shipped' AND total > 100;
-- Parsed as: status = 'paid' OR (status = 'shipped' AND total > 100)

-- Explicit grouping
SELECT * FROM orders WHERE (status = 'paid' OR status = 'shipped') AND total > 100;

Arithmetic Operators

Operator	Meaning	Example	Result
`+`	Addition	`price + tax`	—
`-`	Subtraction	`stock - sold`	—
`*`	Multiplication	`quantity * unit_price`	—
`/`	Division	`total / 1.19`	—
`%`	Modulo	`id % 10`	0–9

Integer division truncates toward zero: 7 / 2 = 3.

Division by zero raises a runtime error (22012 division_by_zero).

SELECT
    price,
    price * 0.19        AS tax,
    price * 1.19        AS price_with_tax,
    ROUND(price, 2)     AS rounded
FROM products;

Comparison Operators

Operator	Meaning	NULL behavior
`=`	Equal	Returns NULL if either operand is NULL
`<>`, `!=`	Not equal	Returns NULL if either operand is NULL
`<`	Less than	Returns NULL if either operand is NULL
`<=`	Less than or equal	Returns NULL if either operand is NULL
`>`	Greater than	Returns NULL if either operand is NULL
`>=`	Greater than or equal	Returns NULL if either operand is NULL

SELECT * FROM products WHERE price = 49.99;
SELECT * FROM products WHERE stock <> 0;
SELECT * FROM orders   WHERE total >= 100;

Boolean Operators

Operator	Meaning
`AND`	TRUE only if both operands are TRUE
`OR`	TRUE if at least one operand is TRUE
`NOT`	Negates a boolean value

NULL Semantics — Three-Valued Logic

AxiomDB implements SQL three-valued logic: every boolean expression evaluates to TRUE, FALSE, or UNKNOWN (which SQL represents as NULL in boolean context). The rules below are critical for writing correct WHERE clauses.

AND truth table

AND	TRUE	FALSE	UNKNOWN
TRUE	TRUE	FALSE	UNKNOWN
FALSE	FALSE	FALSE	FALSE
UNKNOWN	UNKNOWN	FALSE	UNKNOWN

OR truth table

OR	TRUE	FALSE	UNKNOWN
TRUE	TRUE	TRUE	TRUE
FALSE	TRUE	FALSE	UNKNOWN
UNKNOWN	TRUE	UNKNOWN	UNKNOWN

NOT truth table

NOT	Result
TRUE	FALSE
FALSE	TRUE
UNKNOWN	UNKNOWN

Key consequences

-- NULL compared to anything is UNKNOWN, not TRUE or FALSE
SELECT NULL = NULL;      -- UNKNOWN (NULL, not TRUE)
SELECT NULL <> NULL;     -- UNKNOWN
SELECT NULL = 1;         -- UNKNOWN

-- WHERE filters only rows where condition is TRUE
-- Rows where the condition is UNKNOWN are excluded
SELECT * FROM users WHERE age = NULL;    -- always returns 0 rows!
SELECT * FROM users WHERE age IS NULL;   -- correct NULL check

-- UNKNOWN in AND
SELECT * FROM orders WHERE total > 100 AND NULL;  -- 0 rows (UNKNOWN is filtered)

-- UNKNOWN in OR
SELECT * FROM orders WHERE total > 100 OR NULL;   -- rows where total > 100

IS NULL / IS NOT NULL

These predicates are the correct way to check for NULL. They always return TRUE or FALSE, never UNKNOWN.

-- Find unshipped orders
SELECT * FROM orders WHERE shipped_at IS NULL;

-- Find orders that have been shipped
SELECT * FROM orders WHERE shipped_at IS NOT NULL;

-- Combine with other conditions
SELECT * FROM users WHERE deleted_at IS NULL AND age > 18;

BETWEEN

BETWEEN low AND high is inclusive on both ends. Equivalent to >= low AND <= high.

-- Products priced between $10 and $50 inclusive
SELECT * FROM products WHERE price BETWEEN 10 AND 50;

-- Orders placed in Q1 2026
SELECT * FROM orders
WHERE placed_at BETWEEN '2026-01-01 00:00:00' AND '2026-03-31 23:59:59';

-- NOT BETWEEN
SELECT * FROM products WHERE price NOT BETWEEN 10 AND 50;

LIKE — Pattern Matching

LIKE matches strings against a pattern.

Wildcard	Meaning
`%`	Any sequence of zero or more characters
`_`	Exactly one character

Pattern matching is case-sensitive by default. Use CITEXT columns or ILIKE for case-insensitive matching.

-- Emails from example.com
SELECT * FROM users WHERE email LIKE '%@example.com';

-- Names starting with 'Al'
SELECT * FROM users WHERE name LIKE 'Al%';

-- Exactly 5-character codes
SELECT * FROM products WHERE sku LIKE '_____';

-- NOT LIKE
SELECT * FROM users WHERE email NOT LIKE '%@test.%';

-- Escape a literal %
SELECT * FROM products WHERE description LIKE '50\% off' ESCAPE '\';

IN — Membership Test

IN checks whether a value matches any element in a list.

-- Multiple status values
SELECT * FROM orders WHERE status IN ('pending', 'paid', 'shipped');

-- Numeric list
SELECT * FROM products WHERE category_id IN (1, 3, 7);

-- NOT IN
SELECT * FROM orders WHERE status NOT IN ('cancelled', 'refunded');

NOT IN (list) returns UNKNOWN (no rows) if any element in the list is NULL. Use NOT EXISTS or explicit NULL checks when the list may contain NULLs.

-- Safe: explicit list with no NULLs
SELECT * FROM orders WHERE status NOT IN ('cancelled', 'refunded');

-- Dangerous if user_id can be NULL:
SELECT * FROM orders WHERE user_id NOT IN (SELECT id FROM banned_users);
-- If banned_users contains even one NULL user, this returns 0 rows!
-- Safe alternative:
SELECT * FROM orders o
WHERE NOT EXISTS (
    SELECT 1 FROM banned_users b WHERE b.id = o.user_id AND b.id IS NOT NULL
);

Scalar Functions

Numeric Functions

Function	Description	Example
`ABS(x)`	Absolute value	`ABS(-5)` → `5`
`CEIL(x)`	Ceiling (round up)	`CEIL(1.2)` → `2`
`FLOOR(x)`	Floor (round down)	`FLOOR(1.9)` → `1`
`ROUND(x, d)`	Round to `d` decimal places	`ROUND(3.14159, 2)` → `3.14`
`MOD(x, y)`	Modulo	`MOD(10, 3)` → `1`
`POWER(x, y)`	x raised to the power y	`POWER(2, 8)` → `256`
`SQRT(x)`	Square root	`SQRT(16)` → `4`

String Functions

Function	Description	Example
`LENGTH(s)`	Number of bytes	`LENGTH('hello')` → `5`
`CHAR_LENGTH(s)`	Number of UTF-8 characters	`CHAR_LENGTH('café')` → `4`
`UPPER(s)`	Convert to uppercase	`UPPER('hello')` → `'HELLO'`
`LOWER(s)`	Convert to lowercase	`LOWER('HELLO')` → `'hello'`
`TRIM(s)`	Remove leading and trailing spaces	`TRIM(' hi ')` → `'hi'`
`LTRIM(s)`	Remove leading spaces	—
`RTRIM(s)`	Remove trailing spaces	—
`SUBSTR(s, pos, len)`	Substring from position (1-indexed)	`SUBSTR('hello', 2, 3)` → `'ell'`
`CONCAT(a, b, ...)`	Concatenate strings	`CONCAT('foo', 'bar')` → `'foobar'`
`REPLACE(s, from, to)`	Replace all occurrences	`REPLACE('aabbcc', 'bb', 'X')` → `'aaXcc'`
`LPAD(s, n, pad)`	Pad on the left to length n	`LPAD('42', 5, '0')` → `'00042'`
`RPAD(s, n, pad)`	Pad on the right to length n	—

String Concatenation — `||`

The || operator concatenates two string values. It is the SQL-standard alternative to CONCAT() and works in any expression context.

-- Build a full name from two columns
SELECT first_name || ' ' || last_name AS full_name FROM users;

-- Append a suffix
SELECT sku || '-v2' AS new_sku FROM products;

-- NULL propagates: if either operand is NULL the result is NULL
SELECT 'hello' || NULL;   -- NULL

Use COALESCE to guard against NULL operands:

SELECT COALESCE(first_name, '') || ' ' || COALESCE(last_name, '') AS full_name
FROM users;

CAST — Explicit Type Conversion

CAST(expr AS type) converts a value to the specified type. Use it when an implicit coercion would be rejected in strict mode (the default).

-- Text-to-number: always works when the text is a valid number
SELECT CAST('42' AS INT);        -- 42
SELECT CAST('3.14' AS REAL);     -- 3.14
SELECT CAST('100' AS BIGINT);    -- 100

-- Use CAST to store a text literal in a numeric column
INSERT INTO users (age) VALUES (CAST('30' AS INT));

💡

Current Limitation CAST(numeric AS TEXT) — converting an integer or real value to text — is not supported in the current release and raises 22018 invalid_character_value_for_cast. Use application-side formatting or wait for Phase 5 (full coercion matrix). The supported direction is text → number, not number → text.

Supported CAST pairs (Phase 4.16):

From	To	Notes
`TEXT`	`INT`, `BIGINT`	Entire string must be a valid integer
`TEXT`	`REAL`	Entire string must be a valid float
`TEXT`	`DECIMAL`	Entire string must be a valid decimal
`INT`	`BIGINT`, `REAL`, `DECIMAL`	Widening — always succeeds
`BIGINT`	`REAL`, `DECIMAL`	Widening — always succeeds
`NULL`	any	Always returns NULL

Conditional Functions

Function	Description
`COALESCE(a, b, ...)`	Return first non-NULL argument
`NULLIF(a, b)`	Return NULL if a = b, otherwise return a
`IIF(cond, then, else)`	Inline if-then-else
`CASE WHEN ... THEN ... END`	General conditional expression

-- COALESCE: display a fallback when the column is NULL
SELECT name, COALESCE(phone, 'N/A') AS contact FROM users;

-- NULLIF: convert 'unknown' to NULL (for aggregate functions to ignore)
SELECT AVG(NULLIF(rating, 0)) AS avg_rating FROM products;

-- CASE: categorize order size
SELECT
    id,
    total,
    CASE
        WHEN total < 50   THEN 'small'
        WHEN total < 200  THEN 'medium'
        WHEN total < 1000 THEN 'large'
        ELSE                   'enterprise'
    END AS order_size
FROM orders;

CASE WHEN — Conditional Expressions

CASE WHEN is a general-purpose conditional expression that can appear anywhere an expression is valid: SELECT projections, WHERE clauses, ORDER BY, GROUP BY, HAVING, and as arguments to aggregate functions.

AxiomDB supports two forms: searched CASE (any boolean condition per branch) and simple CASE (equality comparison against a single value).

Searched CASE

Evaluates each WHEN condition left to right and returns the THEN value of the first condition that is TRUE. If no condition matches and an ELSE is present, the ELSE value is returned. If no condition matches and there is no ELSE, the result is NULL.

CASE
    WHEN condition1 THEN result1
    WHEN condition2 THEN result2
    ...
    [ELSE default_result]
END

-- Categorize orders by total amount
SELECT
    id,
    total,
    CASE
        WHEN total < 50    THEN 'small'
        WHEN total < 200   THEN 'medium'
        WHEN total < 1000  THEN 'large'
        ELSE                    'enterprise'
    END AS order_size
FROM orders;

-- Compute a human-readable status label, including NULL handling
SELECT
    id,
    CASE
        WHEN shipped_at IS NULL AND status = 'paid' THEN 'awaiting shipment'
        WHEN shipped_at IS NOT NULL                 THEN 'shipped'
        WHEN status = 'cancelled'                   THEN 'cancelled'
        ELSE                                             'unknown'
    END AS display_status
FROM orders;

Simple CASE

Compares a single expression against a list of values. Equivalent to a searched CASE using = for each WHEN comparison.

CASE expression
    WHEN value1 THEN result1
    WHEN value2 THEN result2
    ...
    [ELSE default_result]
END

-- Map status codes to display labels
SELECT
    id,
    CASE status
        WHEN 'pending'   THEN 'Pending Payment'
        WHEN 'paid'      THEN 'Paid'
        WHEN 'shipped'   THEN 'Shipped'
        WHEN 'delivered' THEN 'Delivered'
        WHEN 'cancelled' THEN 'Cancelled'
        ELSE                  'Unknown'
    END AS status_label
FROM orders;

NULL Semantics in CASE

In a searched CASE, a WHEN condition that evaluates to UNKNOWN (NULL in boolean context) is treated the same as FALSE — it does not match, and evaluation continues to the next branch. This means a NULL condition never triggers a THEN clause.

In a simple CASE, the comparison expression = value uses standard SQL equality, which returns UNKNOWN when either side is NULL. As a result, WHEN NULL never matches. Use a searched CASE with IS NULL to handle NULL values explicitly.

-- Simple CASE: WHEN NULL never matches (NULL <> NULL in equality)
SELECT CASE NULL WHEN NULL THEN 'matched' ELSE 'no match' END;
-- Result: 'no match'

-- Correct way to handle NULL in a simple CASE: use searched form
SELECT
    CASE
        WHEN status IS NULL THEN 'no status'
        ELSE status
    END AS safe_status
FROM orders;

CASE in ORDER BY — Controlled Sort Order

CASE can produce a sort key that cannot be expressed with a single column reference.

-- Sort orders: unshipped first (status='paid'), then by recency
SELECT id, status, placed_at
FROM orders
ORDER BY
    CASE WHEN status = 'paid' AND shipped_at IS NULL THEN 0 ELSE 1 END,
    placed_at DESC;

CASE in GROUP BY — Dynamic Grouping

-- Group products by price tier and count items per tier
SELECT
    CASE
        WHEN price < 25   THEN 'budget'
        WHEN price < 100  THEN 'mid-range'
        ELSE                   'premium'
    END      AS tier,
    COUNT(*) AS product_count,
    AVG(price) AS avg_price
FROM products
WHERE deleted_at IS NULL
GROUP BY
    CASE
        WHEN price < 25   THEN 'budget'
        WHEN price < 100  THEN 'mid-range'
        ELSE                   'premium'
    END
ORDER BY avg_price;

Design note: AxiomDB evaluates CASE expressions during row processing in the executor’s expression evaluator. Short-circuit evaluation guarantees that branches after the first matching WHEN are never evaluated, which prevents side effects (e.g., division by zero in an unreachable branch).

Date / Time Functions

Current date / time

Function	Return type	Description
`NOW()`	TIMESTAMP	Current timestamp (UTC)
`CURRENT_DATE`	DATE	Current date (no time)
`CURRENT_TIME`	TIMESTAMP	Current time (no date)
`CURRENT_TIMESTAMP`	TIMESTAMP	Alias for `NOW()`
`UNIX_TIMESTAMP()`	BIGINT	Current time as Unix seconds

Date component extractors

Function	Returns	Description
`year(val)`	INT	Year (e.g. `2025`)
`month(val)`	INT	Month `1–12`
`day(val)`	INT	Day of month `1–31`
`hour(val)`	INT	Hour `0–23`
`minute(val)`	INT	Minute `0–59`
`second(val)`	INT	Second `0–59`
`DATEDIFF(a, b)`	INT	Days between two dates (`a - b`)

val accepts DATE, TIMESTAMP, or a text string coercible to a date. Returns NULL if the input is NULL or not a valid date type.

SELECT year(NOW()),  month(NOW()),  day(NOW());   -- e.g. 2025, 3, 25
SELECT hour(NOW()),  minute(NOW()), second(NOW()); -- e.g. 14, 30, 45

DATE_FORMAT — format a date as text

DATE_FORMAT(ts, format_string) → TEXT

Formats a DATE or TIMESTAMP value using MySQL-compatible format specifiers. Returns NULL if either argument is NULL or the format string is empty.

Specifier	Description	Example
`%Y`	4-digit year	`2025`
`%y`	2-digit year	`25`
`%m`	Month `01–12`	`03`
`%c`	Month `1–12` (no pad)	`3`
`%M`	Full month name	`March`
`%b`	Abbreviated month name	`Mar`
`%d`	Day `01–31`	`05`
`%e`	Day `1–31` (no pad)	`5`
`%H`	Hour `00–23`	`14`
`%h`	Hour `01–12` (12-hour)	`02`
`%i`	Minute `00–59`	`30`
`%s`/`%S`	Second `00–59`	`45`
`%p`	AM / PM	`PM`
`%W`	Full weekday name	`Tuesday`
`%a`	Abbreviated weekday	`Tue`
`%j`	Day of year `001–366`	`084`
`%w`	Weekday `0=Sun…6=Sat`	`2`
`%T`	Time `HH:MM:SS` (24h)	`14:30:45`
`%r`	Time `HH:MM:SS AM/PM`	`02:30:45 PM`
`%%`	Literal `%`	`%`

Unknown specifiers are passed through literally (%X → %X).

-- Format a stored timestamp as ISO date
SELECT DATE_FORMAT(created_at, '%Y-%m-%d') FROM orders;
-- '2025-03-25'

-- European date format
SELECT DATE_FORMAT(NOW(), '%d/%m/%Y');
-- '25/03/2025'

-- Full datetime
SELECT DATE_FORMAT(NOW(), '%Y-%m-%d %H:%i:%s');
-- '2025-03-25 14:30:45'

-- NULL input → NULL output
SELECT DATE_FORMAT(NULL, '%Y-%m-%d');  -- NULL

STR_TO_DATE — parse a date string

STR_TO_DATE(str, format_string) → DATE | TIMESTAMP | NULL

Parses a text string into a date or timestamp using MySQL-compatible format specifiers (same table as DATE_FORMAT above).

Returns DATE if the format contains only date components.
Returns TIMESTAMP if the format contains any time components (%H, %i, %s).
Returns NULL on any parse failure — never raises an error (MySQL behavior).
Returns NULL if either argument is NULL.

2-digit year rule (%y): 00–69 → 2000–2069; 70–99 → 1970–1999.

-- Parse ISO date → Value::Date
SELECT STR_TO_DATE('2025-03-25', '%Y-%m-%d');

-- Parse European date → Value::Date
SELECT STR_TO_DATE('25/03/2025', '%d/%m/%Y');

-- Parse datetime → Value::Timestamp
SELECT STR_TO_DATE('2025-03-25 14:30:00', '%Y-%m-%d %H:%i:%s');

-- Extract components from a parsed date
SELECT year(STR_TO_DATE('2025-03-25', '%Y-%m-%d'));  -- 2025

-- Round-trip: parse then format
SELECT DATE_FORMAT(STR_TO_DATE('2025-03-25', '%Y-%m-%d'), '%d/%m/%Y');
-- '25/03/2025'

-- Invalid date → NULL (Feb 30 does not exist)
SELECT STR_TO_DATE('2025-02-30', '%Y-%m-%d');  -- NULL

-- Bad format → NULL (never an error)
SELECT STR_TO_DATE('not-a-date', '%Y-%m-%d');  -- NULL

FIND_IN_SET — search a comma-separated list

FIND_IN_SET(needle, csv_list) → INT

Returns the 1-indexed position of needle in the comma-separated string csv_list. Returns 0 if not found. Comparison is case-insensitive. Returns NULL if either argument is NULL.

SELECT FIND_IN_SET('b', 'a,b,c');   -- 2
SELECT FIND_IN_SET('B', 'a,b,c');   -- 2  (case-insensitive)
SELECT FIND_IN_SET('z', 'a,b,c');   -- 0  (not found)
SELECT FIND_IN_SET('a', '');        -- 0  (empty list)
SELECT FIND_IN_SET(NULL, 'a,b,c'); -- NULL

Useful for querying rows where a column holds a comma-separated tag list:

SELECT * FROM articles WHERE FIND_IN_SET('rust', tags) > 0;

⚙️

Design Decision DATE_FORMAT and STR_TO_DATE map MySQL format specifiers manually rather than delegating to chrono's own format strings. This is intentional: MySQL's %m means zero-padded month but chrono uses %m differently. Manual mapping guarantees exact MySQL semantics for all 18 specifiers including %T, %r, and 2-digit year rules, without risking divergence from the underlying library's format grammar.

-- DATE_TRUNC and DATE_PART (PostgreSQL-compatible aliases)
SELECT DATE_TRUNC('month', placed_at) AS month, COUNT(*) FROM orders GROUP BY 1;
SELECT DATE_PART('year', created_at) AS signup_year FROM users;

Session Functions

Session functions return state that is specific to the current connection and is not visible to other sessions.

Function	Return type	Description
`LAST_INSERT_ID()`	`BIGINT`	ID generated by the most recent AUTO_INCREMENT INSERT in this session
`lastval()`	`BIGINT`	PostgreSQL-compatible alias for `LAST_INSERT_ID()`
`version()`	`TEXT`	Server version string, e.g. `'8.0.36-AxiomDB-0.1.0'`
`current_user()`	`TEXT`	Authenticated username of the current connection
`session_user()`	`TEXT`	Alias for `current_user()`
`current_database()`	`TEXT`	Name of the current database (`'axiomdb'`)
`database()`	`TEXT`	MySQL-compatible alias for `current_database()`

-- Commonly called by ORMs on connect to verify server identity
SELECT version();             -- '8.0.36-AxiomDB-0.1.0'
SELECT current_user();        -- 'root'
SELECT current_database();    -- 'axiomdb'

Semantics:

Returns 0 if no AUTO_INCREMENT INSERT has occurred in the current session.
For a single-row INSERT, returns the generated ID.
For a multi-row INSERT (INSERT INTO t VALUES (...), (...), ...), returns the ID generated for the first row of the batch (MySQL semantics). Subsequent rows receive consecutive IDs.
Inserting an explicit non-NULL value into an AUTO_INCREMENT column does not advance the sequence and does not update LAST_INSERT_ID().
TRUNCATE TABLE resets the sequence to 1 but does not change the session’s LAST_INSERT_ID() value.

CREATE TABLE items (id BIGINT PRIMARY KEY AUTO_INCREMENT, name TEXT);

-- Single-row INSERT
INSERT INTO items (name) VALUES ('Widget');
SELECT LAST_INSERT_ID();      -- 1
SELECT lastval();             -- 1

-- Multi-row INSERT
INSERT INTO items (name) VALUES ('Gadget'), ('Gizmo'), ('Doohickey');
SELECT LAST_INSERT_ID();      -- 2 (first generated ID in the batch)

-- Explicit value — does not change LAST_INSERT_ID()
INSERT INTO items (id, name) VALUES (99, 'Special');
SELECT LAST_INSERT_ID();      -- still 2

-- Use inside the same statement (e.g., insert a child row)
INSERT INTO orders (user_id, item_id) VALUES (42, LAST_INSERT_ID());

Aggregate Functions

Function	Description	NULL behavior
`COUNT(*)`	Count all rows in the group	Includes NULL rows
`COUNT(col)`	Count non-NULL values in col	Excludes NULL values
`SUM(col)`	Sum of non-NULL values	Returns NULL if all NULL
`AVG(col)`	Arithmetic mean of non-NULL values	Returns NULL if all NULL
`MIN(col)`	Minimum non-NULL value	Returns NULL if all NULL
`MAX(col)`	Maximum non-NULL value	Returns NULL if all NULL

SELECT
    COUNT(*)        AS total_rows,
    COUNT(email)    AS rows_with_email,   -- excludes NULL
    SUM(total)      AS gross_revenue,
    AVG(total)      AS avg_order_value,
    MIN(placed_at)  AS first_order,
    MAX(placed_at)  AS last_order
FROM orders
WHERE status != 'cancelled';

GROUP_CONCAT — String Aggregation

GROUP_CONCAT concatenates non-NULL values across the rows of a group into a single string. It is MySQL’s most widely-used aggregate function for collecting tags, roles, categories, and comma-separated lists without a client-side join.

string_agg(expr, separator) is the PostgreSQL-compatible alias.

Syntax

GROUP_CONCAT([DISTINCT] expr [ORDER BY col [ASC|DESC], ...] [SEPARATOR 'str'])

string_agg(expr, separator)

Clause	Default	Description
`DISTINCT`	off	Deduplicate values before concatenating
`ORDER BY`	none	Sort values within the group before joining
`SEPARATOR`	`','`	String inserted between values

Behavior

NULL values are skipped — they do not appear in the result and do not add a separator.
An empty group (no rows) or a group where every value is NULL returns NULL.
A single value returns that value with no separator added.
Result is truncated to 1 MB (1,048,576 bytes) maximum.

-- Basic: comma-separated tags per post
SELECT post_id, GROUP_CONCAT(tag ORDER BY tag ASC)
FROM post_tags
GROUP BY post_id;
-- post 1 → 'async,db,rust'
-- post 2 → 'rust,web'
-- post 3 (all NULL tags) → NULL

-- Custom separator
SELECT GROUP_CONCAT(tag ORDER BY tag ASC SEPARATOR ' | ')
FROM post_tags
WHERE post_id = 1;
-- → 'async | db | rust'

-- DISTINCT: deduplicate before joining
SELECT GROUP_CONCAT(DISTINCT tag ORDER BY tag ASC)
FROM tags;
-- Duplicate 'rust' rows → 'async,db,rust' (appears once)

-- string_agg PostgreSQL alias
SELECT string_agg(tag, ', ')
FROM post_tags
WHERE post_id = 2;
-- → 'rust, web' (or 'web, rust' — insertion order)

-- HAVING on a GROUP_CONCAT result
SELECT post_id, GROUP_CONCAT(tag ORDER BY tag ASC) AS tags
FROM post_tags
GROUP BY post_id
HAVING GROUP_CONCAT(tag ORDER BY tag ASC) LIKE '%rust%';
-- Only posts that have the 'rust' tag

-- Collect integers as text
SELECT GROUP_CONCAT(n ORDER BY n ASC) FROM nums;
-- 1, 2, 3 → '1,2,3'

💡

Tip — MySQL compatibility AxiomDB supports the full MySQL GROUP_CONCAT syntax including DISTINCT, multi-column ORDER BY, and the SEPARATOR keyword. MySQL codebases that use GROUP_CONCAT for tags or role lists migrate without modification.

BLOB / Binary Functions

AxiomDB stores binary data as the BLOB / BYTES type and provides functions for encoding, decoding, and measuring binary values.

Function	Returns	Description
`FROM_BASE64(text)`	`BLOB`	Decode standard base64 → raw bytes. Returns `NULL` on invalid input.
`TO_BASE64(blob)`	`TEXT`	Encode raw bytes → base64 string. Also accepts `TEXT` and `UUID`.
`OCTET_LENGTH(value)`	`INT`	Byte length of a `BLOB`, `TEXT` (UTF-8 bytes), or `UUID` (always 16).
`ENCODE(blob, fmt)`	`TEXT`	Encode bytes as `'base64'` or `'hex'`.
`DECODE(text, fmt)`	`BLOB`	Decode `'base64'` or `'hex'` text → raw bytes.

Usage examples

-- Store binary data encoded as base64
INSERT INTO files (name, data)
VALUES ('logo.png', FROM_BASE64('iVBORw0KGgoAAAANSUhEUgAA...'));

-- Retrieve as base64 for transport
SELECT name, TO_BASE64(data) AS data_b64 FROM files;

-- Check byte size of a blob
SELECT name, OCTET_LENGTH(data) AS size_bytes FROM files;

-- Hex encoding (PostgreSQL / MySQL ENCODE style)
SELECT ENCODE(data, 'hex') FROM files;          -- → 'deadbeef...'
SELECT DECODE('deadbeef', 'hex');               -- → binary bytes

-- OCTET_LENGTH vs LENGTH for text
SELECT LENGTH('héllo');       -- 5 (characters)
SELECT OCTET_LENGTH('héllo'); -- 6 (UTF-8 bytes: é = 2 bytes)

💡

Tip — Base64 for JSON APIs When returning binary data through a JSON API, wrap the column with TO_BASE64(data) to get a transport-safe string. The client reverses it with FROM_BASE64() on INSERT. This pattern avoids binary encoding issues in MySQL wire protocol text mode.

UUID Functions

AxiomDB generates and validates UUIDs server-side. No application-level library needed — the DB handles UUID primary keys directly.

Function	Returns	Description
`gen_random_uuid()`	`UUID`	UUID v4 — 122 random bits. Aliases: `uuid_generate_v4()`, `random_uuid()`, `newid()`
`uuid_generate_v7()`	`UUID`	UUID v7 — 48-bit unix timestamp + random bits. Alias: `uuid7()`
`is_valid_uuid(text)`	`BOOL`	`TRUE` if text is a valid UUID string (hyphenated or compact). Alias: `is_uuid()`. Returns `NULL` if arg is `NULL`.

Usage

-- Auto-generate a UUID primary key at insert time
CREATE TABLE events (
    id   UUID NOT NULL,
    name TEXT NOT NULL
);

INSERT INTO events (id, name)
VALUES (gen_random_uuid(), 'page_view');

-- Use UUID v7 for tables that benefit from time-ordered inserts
INSERT INTO events (id, name)
VALUES (uuid_generate_v7(), 'checkout');

-- Validate an incoming UUID string before inserting
SELECT is_valid_uuid('550e8400-e29b-41d4-a716-446655440000');  -- TRUE
SELECT is_valid_uuid('not-a-uuid');                             -- FALSE
SELECT is_valid_uuid(NULL);                                     -- NULL

UUID v4 vs UUID v7 — which to use?

-- UUID v4: fully random, best for security-sensitive IDs
-- Format: xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx (122 random bits)
SELECT gen_random_uuid();
-- → 'f47ac10b-58cc-4372-a567-0e02b2c3d479'

-- UUID v7: time-ordered prefix, best for primary keys on B+ Tree indexes
-- Format: [48-bit ms timestamp]-[12-bit rand]-[62-bit rand]
SELECT uuid_generate_v7();
-- → '018e2e3a-1234-7abc-8def-0123456789ab'
--    ^^^^^^^^^^^ always increasing

⚙️

Design Decision — UUID v7 for Primary Keys UUID v4 generates random 122-bit keys. When used as a B+ Tree primary key, each insert lands at a random leaf position, causing frequent page splits and poor cache locality. UUID v7 embeds a 48-bit millisecond timestamp as a prefix — inserts are nearly always at the rightmost leaf, eliminating most splits and matching the sequential-insert performance of AUTO_INCREMENT. For tables receiving hundreds of inserts per second, UUID v7 can be 2-5× faster than v4 for write throughput.

Features

Advanced AxiomDB capabilities beyond basic SQL.

Transactions — BEGIN, COMMIT, ROLLBACK, SAVEPOINT, MVCC, isolation levels
Catalog & Schema — system tables, SHOW TABLES, DESCRIBE, introspection queries
Indexes — B+ Tree indexes, composite indexes, partial indexes, query planning

Transactions

A transaction is a sequence of SQL operations that execute as a single atomic unit: either all succeed (COMMIT) or none of them take effect (ROLLBACK). AxiomDB implements full ACID transactions backed by a Write-Ahead Log and Multi-Version Concurrency Control.

Basic Transaction Control

BEGIN;
-- ... SQL statements ...
COMMIT;   -- make all changes permanent

BEGIN;
-- ... SQL statements ...
ROLLBACK; -- undo all changes since BEGIN

Simple Example — Money Transfer

BEGIN;

-- Debit the sender
UPDATE accounts SET balance = balance - 250.00 WHERE id = 1;

-- Credit the receiver
UPDATE accounts SET balance = balance + 250.00 WHERE id = 2;

-- Both succeed together, or neither succeeds
COMMIT;

If the connection drops after the first UPDATE but before COMMIT, the WAL records both the transaction start and the mutation. During crash recovery, AxiomDB sees no COMMIT record for this transaction and discards the partial change. Account 1 keeps its original balance.

Phases 39.11 and 39.12 extend that internal durability model to the clustered-index storage rewrite: clustered rows now have WAL-backed rollback/savepoint support and crash recovery by primary key plus exact row image. Phase 39.13 makes the first SQL-visible clustered cut: CREATE TABLE with an explicit PRIMARY KEY now creates clustered metadata and a clustered table root. Phase 39.14 extends that cut to clustered INSERT: SQL writes now record clustered WAL/undo directly against the clustered PK tree, and Phase 39.15 opens clustered SELECT over that same storage, Phase 39.16 extends the same transaction contract to clustered UPDATE, and 39.17 now extends it to clustered DELETE as delete-mark plus exact row-image undo. 39.18 adds clustered VACUUM: once a clustered delete-mark is old enough to be physically safe, VACUUM table_name purges the dead row, frees any overflow chain it owned, and cleans dead bookmark entries from clustered secondary indexes. 39.22 adds zero-allocation in-place UPDATE: when all SET columns are fixed-size (INT, BIGINT, REAL, BOOL, DATE, TIMESTAMP), field bytes are patched directly in the page buffer without decoding the row, and ROLLBACK reverses only the changed bytes via UndoClusteredFieldPatch — no full row image stored in the undo log.

CREATE TABLE users (id INT PRIMARY KEY, email TEXT UNIQUE);

BEGIN;
INSERT INTO users VALUES (1, 'alice@example.com');
ROLLBACK;

That rollback now restores clustered INSERT, clustered UPDATE, and clustered DELETE state: the clustered base row goes back to its exact previous row image, and any bookmark-bearing secondary entries are deleted or reinserted to match when the statement rewrote them.

Autocommit

When no explicit BEGIN is issued, each statement executes in its own implicit transaction and is committed automatically on success. This is the default mode.

-- Each of these is its own transaction
INSERT INTO users (name, email) VALUES ('Alice', 'alice@example.com');
INSERT INTO users (name, email) VALUES ('Bob',   'bob@example.com');

To group multiple statements atomically, always use explicit BEGIN ... COMMIT.

SAVEPOINT — Partial Rollback

Savepoints mark a point within a transaction to which you can roll back without aborting the entire transaction. ORMs (Django, Rails, Sequelize) use savepoints internally for partial error recovery.

BEGIN;

INSERT INTO orders (user_id, total) VALUES (1, 99.99);
SAVEPOINT after_order;

INSERT INTO order_items (order_id, product_id, quantity) VALUES (1, 42, 1);
-- Suppose this fails a CHECK constraint

ROLLBACK TO SAVEPOINT after_order;
-- The order row still exists; only the order_item is rolled back

-- Try again with corrected data
INSERT INTO order_items (order_id, product_id, quantity) VALUES (1, 42, 0);
-- Still fails — give up entirely
ROLLBACK;

You can have multiple savepoints with different names:

BEGIN;
SAVEPOINT sp1;
-- ... work ...
SAVEPOINT sp2;
-- ... more work ...
ROLLBACK TO SAVEPOINT sp1;   -- undo everything since sp1
RELEASE SAVEPOINT sp1;       -- destroy the savepoint (optional cleanup)
COMMIT;

MVCC — Multi-Version Concurrency Control

AxiomDB uses MVCC plus a server-side Arc<RwLock<Database>>.

Today that means:

read-only statements (SELECT, SHOW, metadata queries) run concurrently
mutating statements (INSERT, UPDATE, DELETE, DDL, BEGIN/COMMIT/ROLLBACK) are serialized at whole-database granularity
a read that is already running keeps its snapshot while another session commits
row-level locking, deadlock detection, and SELECT ... FOR UPDATE are planned for Phases 13.7, 13.8, and 13.8b

This is good for read-heavy workloads, but it is still below MySQL/InnoDB and PostgreSQL for write concurrency because they already lock at row granularity.

⚙️

Current Concurrency Cut MySQL InnoDB and PostgreSQL allow multiple writers to proceed concurrently when they touch different rows. AxiomDB's current runtime intentionally keeps a database-wide write guard while Phase 13.7 introduces row-level locking, then adds deadlock detection and `FOR UPDATE` syntax in follow-up subphases.

How It Works

When a transaction starts, it receives a snapshot — a consistent view of the database as it existed at that moment. Other transactions may commit new changes while your transaction runs, but your snapshot does not change.

Time →

Txn A (snapshot at T=100):    BEGIN → reads → reads → COMMIT
                               |          |          |
Txn B:                         |  INSERT  |  COMMIT  |
                               |          |          |
Txn A sees the world as it was at T=100.
Txn B's inserts are not visible to Txn A.

This is implemented via the Copy-on-Write B+ Tree: when Txn B writes a page, it creates a new copy rather than overwriting the original. Txn A holds a pointer to the old root and continues reading the old version. When Txn A commits, the old pages become eligible for reclamation.

No Per-Page Read Latches

Readers access immutable snapshots and owned page copies, so they do not take per-page latches in the storage layer. The current server runtime still uses a database-wide RwLock, so the real guarantee today is:

many reads can run together
writes do not run in parallel with other writes

Current Write Behavior

Two sessions do not currently mutate different rows in parallel. Instead, the server queues mutating statements behind the database-wide write guard. lock_timeout applies to that wait today.

This means you should not yet build on assumptions such as:

row-level deadlock detection
40001 serialization_failure retries for ordinary write-write conflicts
SELECT ... FOR UPDATE / SKIP LOCKED job-queue patterns

Those behaviors are planned, but not implemented yet.

Isolation Levels

AxiomDB currently accepts three wire-visible isolation names:

READ COMMITTED
REPEATABLE READ (session default)
SERIALIZABLE

READ COMMITTED and REPEATABLE READ have distinct snapshot behavior today. SERIALIZABLE is accepted and stored, but currently uses the same frozen-snapshot policy as REPEATABLE READ; true SSI is still planned.

READ COMMITTED

Each statement within the transaction sees data committed before that statement began. A second SELECT within the same transaction may see different data if another transaction committed between the two SELECTs.

SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
BEGIN;

SELECT balance FROM accounts WHERE id = 1;  -- sees T=100: balance = 1000
-- Txn B commits: UPDATE accounts SET balance = 900 WHERE id = 1
SELECT balance FROM accounts WHERE id = 1;  -- sees T=110: balance = 900 (changed!)

COMMIT;

Use READ COMMITTED when:

You need maximum concurrency
Each statement needing the freshest possible data is acceptable
You are running analytics that can tolerate non-repeatable reads

REPEATABLE READ (default)

The entire transaction sees the snapshot from the moment BEGIN was executed. No matter how many other transactions commit, your reads return the same data.

SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
BEGIN;

SELECT balance FROM accounts WHERE id = 1;  -- snapshot at T=100: balance = 1000
-- Txn B commits: UPDATE accounts SET balance = 900 WHERE id = 1
SELECT balance FROM accounts WHERE id = 1;  -- still sees T=100: balance = 1000

COMMIT;

Use REPEATABLE READ when:

You need consistent data across multiple reads in one transaction
Running reports or multi-step calculations where consistency matters
Implementing optimistic locking patterns

Isolation Level Comparison

Phenomenon	READ COMMITTED	REPEATABLE READ
Dirty reads	Never	Never
Non-repeatable reads	Possible	Never
Phantom reads	Possible	Prevented by current single-writer runtime
Concurrent writes	Serialized globally	Serialized globally

SERIALIZABLE

SERIALIZABLE is accepted for MySQL/PostgreSQL compatibility, but today it uses the same frozen snapshot as REPEATABLE READ. The engine does not yet run Serializable Snapshot Isolation conflict tracking.

E-commerce Checkout — Current Safe Pattern

Until row-level locking lands, the supported stock-reservation pattern is a guarded UPDATE ... WHERE stock >= ? plus affected-row checks.

BEGIN;

-- Reserve stock atomically; application checks that each UPDATE affects 1 row.
UPDATE products SET stock = stock - 2 WHERE id = 1 AND stock >= 2;
UPDATE products SET stock = stock - 1 WHERE id = 3 AND stock >= 1;

-- Create the order header
INSERT INTO orders (user_id, total, status)
VALUES (99, 149.97, 'paid');

-- Create order items
INSERT INTO order_items (order_id, product_id, quantity, unit_price) VALUES
    (LAST_INSERT_ID(), 1, 2, 49.99),
    (LAST_INSERT_ID(), 3, 1, 49.99);

COMMIT;

If any step fails (constraint violation, connection drop, server crash), the WAL ensures the entire transaction is rolled back on recovery.

💡

`FOR UPDATE` Not Yet Available `SELECT ... FOR UPDATE`, `FOR SHARE`, `NOWAIT`, and `SKIP LOCKED` are planned for the row-locking phases and are not implemented in the current runtime. Use guarded `UPDATE ... WHERE ...` statements plus affected-row checks today.

Transaction Performance Tips

Keep transactions short. Long-running transactions hold MVCC versions in memory longer, increasing memory pressure.
Avoid user interaction within a transaction. Never open a transaction and wait for a user to click a button.
For bulk inserts into clustered tables, wrap all rows in a single BEGIN ... COMMIT block. Phase 40.1 introduces ClusteredInsertBatch: rows are staged in memory, sorted by primary key, and flushed at COMMIT using the rightmost-leaf batch append path. This reduces O(N) CoW page-clone operations to O(N / leaf_capacity) page writes — delivering 55.9K rows/s for 50K sequential PK rows vs MySQL 8.0 InnoDB’s ~35K rows/s (+59%).
For bulk loads, consider committing every 50,000–100,000 rows to limit WAL growth while keeping the batch-insert speedup.

🚀

Batch INSERT Advantage AxiomDB 40.1 achieves 55.9K rows/s for 50K sequential PK clustered inserts in one explicit transaction — +59% faster than MySQL 8.0 InnoDB (~35K rows/s) on the same sequential PK workload. MySQL pays one buffer-pool page write per row in the common path; AxiomDB accumulates a full leaf worth of rows and writes the page once.

WAL Fsync Pipeline — Current Server Commit Path

Every durable DML commit still needs WAL fsync, but AxiomDB no longer relies on the old timer-based group-commit window for batching. The server now uses an always-on leader-based fsync pipeline:

one connection becomes the fsync leader
later commits queue behind that leader if their WAL entry is already buffered
if the leader’s fsync covers a later commit’s LSN, that later commit returns without paying another fsync

💡

Tip You no longer need to tune a batch window for this path. The pipeline mainly helps when commits overlap in time. For a strictly sequential request/response client, autocommit throughput is still limited by one durable fsync per visible statement response.

Catalog and Schema Introspection

AxiomDB maintains an internal catalog that records logical databases, tables, columns, and indexes. The catalog is persisted in system heaps rooted from the meta page and is exposed through convenience commands plus catalog-backed SQL resolution.

Databases

Fresh databases always bootstrap a default logical database named axiomdb. Existing databases created before multi-database support are upgraded lazily on open and their legacy tables remain owned by axiomdb.

SHOW DATABASES;

Example output:

Database
axiomdb
analytics

CREATE DATABASE analytics;
USE analytics;
SELECT DATABASE();

Expected result:

DATABASE()
analytics

💡

Legacy Compatibility Old tables created before CREATE DATABASE existed remain visible under the default database axiomdb. You do not need to rewrite old table names just to adopt SHOW DATABASES and USE.

System Tables

The catalog exposes six system tables in the axiom schema. They are always readable without any special privileges.

Table	Purpose
`axiom_tables`	One row per user table
`axiom_columns`	One row per column
`axiom_indexes`	One row per index (logical metadata; clustered PK rows may reuse the table root)
`axiom_constraints`	Named CHECK constraints
`axiom_foreign_keys`	FK constraint definitions
`axiom_stats`	Per-column NDV and row_count for the query planner

axiom_tables

Contains one row per user-visible table.

Phase 39.13 adds physical-layout metadata to these rows even though the introspection surface is still being expanded. The important rule today is:

explicit PRIMARY KEY table → clustered table root
no explicit PRIMARY KEY → heap table root

The catalog now keeps that distinction even before clustered DML is exposed.

Column	Type	Description
`id`	BIGINT	Internal table identifier (table_id)
`schema_name`	TEXT	Schema name (`public` by default)
`table_name`	TEXT	Name of the table
`column_count`	INT	Number of columns
`created_at`	BIGINT	LSN at which the table was created

-- List all user tables
SELECT schema_name, table_name, column_count
FROM axiom_tables
ORDER BY schema_name, table_name;

axiom_columns

Contains one row per column, in declaration order.

Column	Type	Description
`table_id`	BIGINT	Foreign key → `axiom_tables.id`
`table_name`	TEXT	Denormalized table name for convenience
`col_index`	INT	Zero-based position within the table
`col_name`	TEXT	Column name
`data_type`	TEXT	SQL type name (e.g., `TEXT`, `BIGINT`, `DECIMAL`)
`not_null`	BOOL	TRUE if declared NOT NULL
`default_value`	TEXT	DEFAULT expression as a string, or NULL if none

-- All columns of the orders table
SELECT col_index, col_name, data_type, not_null, default_value
FROM axiom_columns
WHERE table_name = 'orders'
ORDER BY col_index;

axiom_indexes

Contains one row per index (including automatically generated PK and UNIQUE indexes).

Column	Type	Description
`id`	BIGINT	Internal index identifier
`table_id`	BIGINT	Foreign key → `axiom_tables.id`
`table_name`	TEXT	Denormalized table name
`index_name`	TEXT	Index name
`is_unique`	BOOL	TRUE for UNIQUE and PRIMARY KEY indexes
`is_primary`	BOOL	TRUE for the PRIMARY KEY index
`columns`	TEXT	Comma-separated list of indexed column names
`root_page_id`	BIGINT	Page ID of the index root; clustered PRIMARY KEY metadata reuses the table root

-- All indexes on the products table
SELECT index_name, is_unique, is_primary, columns
FROM axiom_indexes
WHERE table_name = 'products'
ORDER BY is_primary DESC, index_name;

💡

Clustered PK Metadata On clustered tables, the PRIMARY KEY row in `axiom_indexes` is logical metadata, not a second heap-era tree. Its `root_page_id` matches the table root and may point to clustered pages instead of classic B+ Tree pages.

Convenience Commands

SHOW DATABASES

Lists all logical databases persisted in the catalog.

SHOW DATABASES;

USE

Changes the selected database for the current connection. Unqualified table names are resolved inside that database.

USE analytics;
SHOW TABLES;

If the database does not exist, AxiomDB returns MySQL error 1049:

USE missing_db;
-- ERROR 1049 (42000): Unknown database 'missing_db'

SHOW TABLES

Lists all tables in the current schema.

SHOW TABLES;

Example output:

Table name
accounts
order_items
orders
products
users

SHOW TABLES LIKE

Filters by a LIKE pattern.

SHOW TABLES LIKE 'order%';

Table name
order_items
orders

DESCRIBE (or DESC)

Shows the column structure of a table.

DESCRIBE users;
-- or:
DESC products;

Example output:

Column	Type	Null	Key	Default
id	BIGINT	NO	PRI	AUTO_INCREMENT
email	TEXT	NO	UNI
name	TEXT	NO
age	INT	YES
created_at	TIMESTAMP	NO		CURRENT_TIMESTAMP

Introspection Queries

Because the catalog is exposed as regular tables, you can write arbitrary SQL against it.

Find all NOT NULL columns across all tables

SELECT table_name, col_name, data_type
FROM axiom_columns
WHERE not_null = TRUE
ORDER BY table_name, col_index;

Find tables with no indexes

SELECT t.table_name
FROM axiom_tables t
LEFT JOIN axiom_indexes i ON i.table_id = t.id
WHERE i.id IS NULL
ORDER BY t.table_name;

Find foreign key columns that lack an index

-- Assumes FK columns follow the naming convention: <table>_id
SELECT c.table_name, c.col_name
FROM axiom_columns c
LEFT JOIN axiom_indexes i
    ON i.table_id = c.table_id
   AND i.columns LIKE c.col_name || '%'
WHERE c.col_name LIKE '%_id'
  AND c.col_name <> 'id'
  AND i.id IS NULL
ORDER BY c.table_name, c.col_name;

Column count per table

SELECT table_name, column_count
FROM axiom_tables
ORDER BY column_count DESC;

Catalog Bootstrap

The catalog is bootstrapped on the very first open() call. AxiomDB allocates the catalog roots, inserts the default database axiomdb, and makes the catalog durable before the database accepts traffic. Subsequent opens detect the initialized roots and skip the bootstrap path.

The bootstrap is idempotent: if AxiomDB crashes during bootstrap, the incomplete transaction has no COMMIT record in the WAL, so crash recovery discards it and the next open() re-runs the bootstrap from scratch.

Schema Visibility Rules

The default schema is public. All tables created without an explicit schema prefix belong to public. System tables live in the axiom schema and are always visible.

-- These are equivalent if the default schema is 'public'
CREATE TABLE users (...);
CREATE TABLE public.users (...);

-- System tables require the axiom. prefix or are accessible without schema
SELECT * FROM axiom_tables;          -- works
SELECT * FROM axiom.axiom_tables;   -- also works

Indexes

Indexes are B+ Tree data structures that allow AxiomDB to find rows matching a condition without scanning the entire table. Every index is a Copy-on-Write B+ Tree stored in the same .db file as the table data.

Current Storage Model

Today AxiomDB exposes two SQL-visible table layouts:

tables without an explicit PRIMARY KEY still use the classic heap + index path
tables with an explicit PRIMARY KEY now bootstrap clustered storage at CREATE TABLE time

That new clustered SQL boundary is now wider through 39.18:

the table root is clustered from day one
PRIMARY KEY catalog metadata points at that clustered root
INSERT on clustered tables now works through the clustered PK tree
SELECT on clustered tables now works through the clustered PK tree and clustered secondary bookmarks
UPDATE on clustered tables now rewrites rows directly in the clustered PK tree
DELETE on clustered tables now applies clustered delete-mark through the clustered PK tree
VACUUM table_name on clustered tables now physically purges safe dead rows, frees overflow chains, and cleans dead secondary bookmarks
ALTER TABLE legacy_table REBUILD now migrates legacy heap+PRIMARY KEY tables into clustered layout and rebuilds secondary indexes as PK-bookmark indexes

Internally, the storage rewrite already has clustered insert, point lookup, range scan, same-leaf update, delete-mark, structural rebalance / relocate-update, secondary PK bookmarks, and overflow-backed clustered rows for large payloads, and explicit-PK CREATE TABLE now records that layout in SQL metadata. Phase 39.14 made the first executor-visible clustered write cut, 39.15 opened the read side, 39.16 brought UPDATE onto that same clustered path, and 39.17 now does the same for logical clustered DELETE: PK lookups/ranges, clustered secondary bookmark probes, in-place delete-mark, and rollback-safe WAL all stay on clustered storage. 39.18 closes the first clustered maintenance slice too: VACUUM now purges physically dead clustered cells and their overflow/secondary debris instead of leaving clustered cleanup as a future-only promise.

That internal rewrite is still honest about its current boundary:

relocate-update rewrites only the current inline version
clustered delete is still delete-mark first, then later VACUUM
large clustered rows can already spill to overflow pages internally, but SQL only explicit-PK tables expose clustered layout at DDL time
clustered covering reads still degrade to fetching the clustered row body; a true clustered index-only optimization is still future work
clustered child-table foreign-key enforcement still remains future work

💡

Current Behavior `CREATE TABLE users (id INT PRIMARY KEY, ...)` now creates clustered storage, `INSERT INTO users ...` writes through the clustered PK tree, `SELECT ...` reads directly from clustered storage, `UPDATE ...` rewrites the clustered row plus any bookmark-bearing secondary entries, `DELETE ...` applies a clustered delete-mark, `VACUUM users` now physically reclaims clustered dead rows plus overflow/secondary garbage that is safe to purge, `ALTER TABLE users REBUILD` migrates older heap+PK tables into that same layout, and `CREATE INDEX` / `CREATE UNIQUE INDEX` now works on clustered tables exactly as on heap tables.

⚙️

Design Decision — Direct PK Tree Writes SQLite WITHOUT ROWID and InnoDB both treat the clustered key as the row-storage identity. AxiomDB now does the same for SQL-visible clustered INSERT: no heap fallback row is created, and non-primary indexes store PK bookmarks instead of heap-era RecordId payloads.

Index Statistics and Query Planner

AxiomDB maintains per-column statistics to help the query planner choose between an index scan and a full table scan.

How it works

When you create an index, AxiomDB automatically computes:

row_count — total visible rows in the table
ndv (number of distinct values) — exact count of distinct non-NULL values

The planner uses selectivity = 1 / ndv for equality predicates. If selectivity > 20% of rows would be returned, a full table scan is cheaper than an index scan, so the planner uses the table scan.

ndv = 3,   rows = 10,000  →  selectivity = 33%  > 20%  →  Scan
ndv = 100, rows = 10,000  →  selectivity = 1%   < 20%  →  Index

ANALYZE command

Run ANALYZE to refresh statistics after bulk inserts or deletes:

-- Analyze a specific table (all indexed columns)
ANALYZE TABLE users;

-- Analyze a specific column only
ANALYZE TABLE orders (status);

Statistics are automatically computed at CREATE INDEX time. Run ANALYZE when:

Significant data was added after the index was created
Query plans seem wrong (e.g., full scan when index would be faster)

Automatic staleness detection

After enough row changes (>20% of the analyzed row count), the planner automatically uses conservative defaults (ndv = 200) until the next ANALYZE. This prevents stale statistics from causing poor query plans.

⚙️

Design Decision — Exact NDV, No Sampling AxiomDB computes exact distinct value counts (no sampling). PostgreSQL uses Vitter's reservoir sampling algorithm for large tables. Exact counting is simpler and correct for the typical table sizes of an embedded database. Reservoir sampling (Duj1 estimator) is planned for a future statistics phase when tables exceed 1M rows.

Composite Indexes

A composite index covers two or more columns. The query planner uses it when the WHERE clause contains equality conditions on the leading columns.

CREATE INDEX idx_user_status ON orders(user_id, status);

-- Uses composite index: both leading columns matched
SELECT * FROM orders WHERE user_id = 42 AND status = 'active';

-- Also uses index via prefix scan: leading column only
SELECT * FROM orders WHERE user_id = 42;

-- Does NOT use index: leading column absent from WHERE
SELECT * FROM orders WHERE status = 'active';

⚙️

Design Decision — Prefix Scan for Leading Column When only the leading column is in the WHERE clause, AxiomDB performs a B-Tree range scan (prefix scan) rather than an exact lookup. This correctly returns all rows matching the leading column, at the cost of a slightly wider range scan vs. a point lookup. PostgreSQL uses the same strategy for index scans on the leading column of a composite index.

Fill Factor

Fill factor controls how full a B-Tree leaf page is allowed to get before it splits. A lower fill factor leaves intentional free space on each page, reducing split frequency for workloads that add rows after index creation.

-- Append-heavy time-series table: pages fill to 70% before splitting.
CREATE INDEX idx_ts ON events(created_at) WITH (fillfactor = 70);

-- Compact read-only index: fill pages completely.
CREATE UNIQUE INDEX uq_email ON users(email) WITH (fillfactor = 100);

-- Default (90%) — equivalent to omitting WITH:
CREATE INDEX idx_x ON t(x);

Range and default

Valid range: 10–100. Default: 90 (matches PostgreSQL’s BTREE_DEFAULT_FILLFACTOR). fillfactor = 100 reproduces the current behavior exactly — pages fill completely before splitting.

Effect on splits

With fillfactor = F:

Leaf page splits when it reaches ⌈F × ORDER_LEAF / 100⌉ entries (instead of at full capacity).
After a split, both new pages hold roughly F/2 % of capacity — leaving room for future inserts without triggering another split.
Internal pages always fill to capacity (not user-configurable).

🚀

Performance Advantage For append-heavy tables (time-series, log tables, auto-increment keys), a fill factor of 70–80 reduces split frequency during inserts because each page has 20–30% free space instead of splitting immediately on the next insert. This lowers write amplification for sequential insert workloads — an optimization also used by PostgreSQL and MariaDB InnoDB for INSERT-heavy indexes.

Automatic Indexes

AxiomDB automatically creates a unique B+ Tree index for:

Every PRIMARY KEY declaration
Every UNIQUE column constraint or UNIQUE table constraint

For clustered tables, the automatically created PRIMARY KEY metadata row reuses the clustered table root instead of allocating a second heap-era PK tree. UNIQUE secondary indexes still allocate ordinary B+ Tree roots, but 39.14 now maintains their entries as secondary_key ++ pk_suffix bookmarks during SQL-visible clustered INSERT.

⚙️

Design Decision — Real PK Root AxiomDB does not create a fake heap-side PRIMARY KEY index for clustered tables just to preserve old executor assumptions. That keeps the catalog aligned with the real physical layout and avoids the hidden-compatibility detour that engines like InnoDB can afford only because they already support a fallback clustered key path.

Multi-row INSERT on Indexed Tables

Multi-row INSERT ... VALUES (...), (... ) statements now stay on a grouped heap/index path even when the target table already has a PRIMARY KEY or secondary indexes.

INSERT INTO users VALUES
  (1, 'a@example.com'),
  (2, 'b@example.com'),
  (3, 'c@example.com');

This matters because indexed tables used to fall back to per-row maintenance on this workload. The grouped path keeps the same SQL-visible behavior:

duplicate PRIMARY KEY / UNIQUE values inside the same statement still fail
a failed multi-row statement does not leak partial committed rows
partial indexes still include only rows whose predicate matches

🚀

Performance Advantage On the local PK-only benchmark, AxiomDB multi-row INSERT reaches 321,002 rows/s vs MariaDB 12.1 at 160,581 rows/s. The gain comes from grouped heap/index apply instead of per-row indexed maintenance for the statement.

Startup Integrity Verification

When a database opens, AxiomDB verifies every catalog-visible index against the heap-visible rows reconstructed after WAL recovery.

If the tree is readable but its contents diverge from the heap, AxiomDB rebuilds the index automatically from table contents before serving traffic.
If the tree cannot be traversed safely, open fails with IndexIntegrityFailure instead of guessing.

This check applies to both embedded mode and server mode because both call the same startup verifier.

⚙️

Design Decision — Auto-Rebuild Only When Safe AxiomDB combines PostgreSQL amcheck's “never trust an unreadable B-Tree” rule with SQLite's REINDEX-style rebuild-from-table approach. Readable divergence is healed automatically from heap data; unreadable trees still block open.

Creating Indexes Manually

CREATE [UNIQUE] INDEX index_name ON table_name (col1 [ASC|DESC], col2 ...);
CREATE INDEX idx_users_name   ON users   (name);
CREATE INDEX idx_orders_user  ON orders  (user_id, placed_at DESC);
CREATE UNIQUE INDEX uq_sku    ON products (sku);

See DDL — CREATE INDEX for the full syntax.

When Indexes Help

The query planner considers an index when:

The leading column(s) of the index appear in a WHERE equality or range condition.
The index columns match the ORDER BY direction and order (avoids a sort step).
The index is selective enough that scanning it is cheaper than a full table scan.

-- Good: leading column (user_id) used in WHERE
CREATE INDEX idx_orders_user ON orders (user_id, placed_at DESC);
SELECT * FROM orders WHERE user_id = 42 ORDER BY placed_at DESC;

-- Bad: leading column not in WHERE — index not used
SELECT * FROM orders WHERE placed_at > '2026-01-01';
-- Solution: create a separate index on placed_at
CREATE INDEX idx_orders_date ON orders (placed_at);

Composite Index Column Order

The order of columns in a composite index determines which query patterns it accelerates. The B+ Tree is sorted by the concatenated key (col1, col2, ...).

CREATE INDEX idx_orders_user_status ON orders (user_id, status);

This index accelerates:

WHERE user_id = 42
WHERE user_id = 42 AND status = 'paid'

This index does NOT accelerate:

WHERE status = 'paid' (leading column not constrained)

Rule of thumb: put the highest-selectivity, most frequently filtered column first.

Partial Indexes

A partial index covers only the rows matching a WHERE predicate. This reduces index size and maintenance cost.

-- Index only pending orders (the common access pattern)
CREATE INDEX idx_pending_orders ON orders (user_id)
WHERE status = 'pending';

-- Index only non-deleted users
CREATE INDEX idx_active_users ON users (email)
WHERE deleted_at IS NULL;

The query planner uses a partial index only when the query’s WHERE clause implies the index’s predicate.

Index Key Size Limit

The B+ Tree stores encoded keys up to 768 bytes. For most column types this is never an issue:

INT, BIGINT, UUID, TIMESTAMP — fixed-size, always well under the limit.
TEXT, VARCHAR — a 760-character value will just fit. If you index a column with very long strings (> 750 characters), rows exceeding the limit are silently skipped at CREATE INDEX time and return IndexKeyTooLong on INSERT.

⚙️

Design Decision 768 bytes is chosen to match the page-layout constant `MAX_KEY_LEN` (derived so that `ORDER_LEAF × 768 + overhead ≤ 16 384 bytes`). Unlike MySQL InnoDB which silently truncates long keys (leading to false positives on lookup), AxiomDB rejects oversized keys at write time — correctness is never compromised.

Query Planner — Phase 6.3

The planner rewrites the execution plan before running the scan. Currently recognized patterns:

Equality lookup — exact match on the leading indexed column:

-- Uses B-Tree point lookup (O(log n) instead of O(n))
SELECT * FROM users WHERE email = 'alice@example.com';
SELECT * FROM orders WHERE id = 42;

This includes the PRIMARY KEY. A query like WHERE id = 42 does not need a redundant secondary index on id.

Range scan — upper and lower bound on the leading indexed column:

-- Uses B-Tree range scan
SELECT * FROM orders WHERE created_at > '2024-01-01' AND created_at < '2025-01-01';
SELECT * FROM products WHERE price >= 10.0 AND price <= 50.0;

Full scan fallback — any pattern not recognized above:

-- Falls back to full table scan (no index for OR, function, or non-leading column)
SELECT * FROM users WHERE email LIKE '%gmail.com';
SELECT * FROM orders WHERE status = 'paid' OR total > 1000;

🚀

Performance Advantage A point lookup on a 1M-row table takes O(log n) ≈ 20 page reads vs O(n) = 1M reads for a full scan — roughly a 50,000× reduction in I/O. AxiomDB's planner applies this automatically when a matching index exists, including the PRIMARY KEY, with zero configuration required.

💡

No Redundant PK Index If `id` is already your PRIMARY KEY, do not also create `CREATE INDEX ... ON t(id)` just for point lookups. The planner already uses the primary-key B+Tree for `WHERE id = ...`.

Partial Indexes

A partial index covers only the rows matching a WHERE predicate. This reduces index size, speeds up maintenance, and — for UNIQUE indexes — restricts uniqueness enforcement to the matching subset.

-- Only active users need unique emails.
CREATE UNIQUE INDEX uq_active_email ON users(email) WHERE deleted_at IS NULL;

-- Index only pending orders for fast user lookups.
CREATE INDEX idx_pending ON orders(user_id) WHERE status = 'pending';

Partial UNIQUE indexes

The uniqueness constraint applies only among rows satisfying the predicate. Rows that do not satisfy the predicate are never inserted into the index.

-- alice deleted, then re-created: no conflict.
INSERT INTO users VALUES (1, 'alice@x.com', '2025-01-01'); -- deleted
INSERT INTO users VALUES (2, 'alice@x.com', NULL);          -- active ✅
INSERT INTO users VALUES (3, 'alice@x.com', NULL);          -- ❌ UniqueViolation
INSERT INTO users VALUES (4, 'alice@x.com', '2025-06-01');  -- deleted ✅

Planner support

The planner uses a partial index only when the query’s WHERE clause implies the index predicate. If the implication cannot be verified, the planner falls back to a full scan or a full index — always correct.

-- Uses partial index (WHERE contains `deleted_at IS NULL`):
SELECT * FROM users WHERE email = 'alice@x.com' AND deleted_at IS NULL;

-- Falls back to full scan (predicate not in WHERE):
SELECT * FROM users WHERE email = 'alice@x.com';

🚀

Performance Advantage A partial unique index on a soft-delete table (e.g.,

WHERE deleted_at IS
NULL

) is typically 10–100× smaller than a full unique index, since most rows in high-churn tables are in the deleted state. This reduces build time, per-INSERT maintenance cost, and bloom filter memory. MySQL InnoDB does not support partial indexes, so this optimization is not available there.

Foreign Key Constraints

Foreign key constraints ensure referential integrity between tables. Every non-NULL value in the FK column of the child table must reference an existing row in the parent table.

-- Inline REFERENCES syntax
CREATE TABLE orders (
  id      INT PRIMARY KEY,
  user_id INT REFERENCES users(id) ON DELETE CASCADE
);

-- Table-level FOREIGN KEY syntax
CREATE TABLE order_items (
  id         INT PRIMARY KEY,
  order_id   INT,
  product_id INT,
  CONSTRAINT fk_order   FOREIGN KEY (order_id)   REFERENCES orders(id)   ON DELETE CASCADE,
  CONSTRAINT fk_product FOREIGN KEY (product_id) REFERENCES products(id) ON DELETE RESTRICT
);

-- Add FK after the fact
ALTER TABLE orders
  ADD CONSTRAINT fk_user FOREIGN KEY (user_id) REFERENCES users(id);

-- Remove a FK constraint
ALTER TABLE orders DROP CONSTRAINT fk_user;

ON DELETE Actions

Action	Behavior
`RESTRICT` / `NO ACTION` (default)	Error if child rows reference the deleted parent row
`CASCADE`	Automatically delete all child rows (recursive, max depth 10)
`SET NULL`	Set child FK column to NULL (column must be nullable)

Enforcement Examples

CREATE TABLE users  (id INT PRIMARY KEY, email TEXT);
CREATE TABLE orders (id INT PRIMARY KEY, user_id INT REFERENCES users(id) ON DELETE CASCADE);

INSERT INTO users  VALUES (1, 'alice@x.com');
INSERT INTO orders VALUES (10, 1);            -- ✅ user 1 exists

-- INSERT with missing parent → error
INSERT INTO orders VALUES (20, 999);
-- ERROR 23503: Foreign key constraint fails: 'orders.user_id' = '999'

-- DELETE parent with CASCADE → child rows automatically deleted
DELETE FROM users WHERE id = 1;
SELECT COUNT(*) FROM orders;  -- → 0 (orders were cascaded)

-- DELETE parent with RESTRICT (default) → blocked if children exist
CREATE TABLE invoices (id INT PRIMARY KEY, order_id INT REFERENCES orders(id));
INSERT INTO users   VALUES (2, 'bob@x.com');
INSERT INTO orders  VALUES (30, 2);
INSERT INTO invoices VALUES (1, 30);
DELETE FROM orders WHERE id = 30;
-- ERROR 23503: foreign key constraint "fk_invoices_order_id": invoices.order_id references this row

NULL FK Values

A NULL value in a FK column is always allowed — it does not reference any parent row. This follows SQL standard MATCH SIMPLE semantics.

INSERT INTO orders VALUES (99, NULL);  -- ✅ NULL user_id is always allowed

ON UPDATE

Only ON UPDATE RESTRICT (the default) is enforced. Updating a parent key while child rows reference it is rejected. ON UPDATE CASCADE and ON UPDATE SET NULL are planned for Phase 6.10.

Current Limitations

Only single-column FKs are supported. Composite FKs — FOREIGN KEY (a, b) REFERENCES t(x, y) — are planned for Phase 6.10.
ON UPDATE CASCADE / ON UPDATE SET NULL are planned for Phase 6.10.
FK validation uses B-Tree range scans via the FK auto-index (Phase 6.9). Falls back to full table scan for pre-6.9 FKs.

Bloom Filter Optimization

AxiomDB maintains an in-memory Bloom filter for each secondary index. The filter allows the query executor to skip B-Tree page reads entirely when a lookup key is definitively absent from the index.

How It Works

When the planner chooses an index lookup for a WHERE col = value condition, the executor checks the Bloom filter before touching the B-Tree:

Filter says no → key is 100% absent. Zero B-Tree pages read. Empty result returned immediately.
Filter says maybe → normal B-Tree lookup proceeds.

The filter is a probabilistic data structure: it never produces false negatives (a key that exists will always get a “maybe”), but can produce false positives (a key that does not exist may occasionally get a “maybe” instead of “no”). The false positive rate is tuned to 1% — at most 1 in 100 absent-key lookups will still read the B-Tree.

🚀

Performance Advantage For workloads where many queries look up keys that do not exist (authentication checks, cache-miss patterns, soft-delete queries), the Bloom filter eliminates all B-Tree I/O for ~99% of misses. A B-Tree point lookup on a 1M-row table reads ~20 pages; with a Bloom filter hit, it reads zero.

Lifecycle

Event	Effect on Bloom filter
`CREATE INDEX`	Filter created and populated with all existing keys
`INSERT`	New key added to filter
`UPDATE`	Old key marks filter dirty; new key added
`DELETE`	Filter marked dirty (deleted keys cannot be removed from a standard Bloom filter)
`DROP INDEX`	Filter removed from memory
Server restart	Filters start empty; `might_exist` returns `true` (conservative) until `CREATE INDEX` is run again

Dirty Filters

After a DELETE or UPDATE, the filter is marked dirty: it may still return “maybe” for keys that were deleted. This does not affect correctness — the B-Tree lookup simply finds no matching row. It only means that some absent keys may not benefit from the zero-I/O shortcut until the filter is rebuilt via ANALYZE TABLE (available since Phase 6.12).

💡

Tip The Bloom filter is most effective on tables where reads vastly outnumber deletes. For high-churn tables (frequent INSERT + DELETE cycles), run ANALYZE TABLE t periodically to rebuild the filter and restore optimal miss performance.

Dropping an Index

-- MySQL syntax (required when the server is in MySQL wire protocol mode)
DROP INDEX index_name ON table_name;
DROP INDEX IF EXISTS idx_old ON table_name;

Dropping an index frees all B-Tree pages, reclaiming disk space immediately.

Dropping an index that backs a PRIMARY KEY or UNIQUE constraint requires dropping the constraint first (via ALTER TABLE DROP CONSTRAINT).

Index Introspection

-- All indexes on a table
SELECT index_name, is_unique, is_primary, columns
FROM axiom_indexes
WHERE table_name = 'orders'
ORDER BY is_primary DESC, index_name;

-- Root page of each index (useful for storage analysis)
SELECT index_name, root_page_id
FROM axiom_indexes;

Index-Only Scans (Covering Indexes)

When every column referenced by a SELECT is already stored as a key column of the chosen index, AxiomDB can satisfy the query entirely from the B-Tree — no heap page read is needed. This is called an index-only scan.

Example

CREATE INDEX idx_age ON users (age);

-- Index-only scan: only column needed (age) is the index key.
SELECT age FROM users WHERE age = 25;

The executor reads the matching B-Tree leaf entries, extracts the age value from the encoded key bytes, and returns the rows without ever touching the heap.

INCLUDE syntax — declaring covering intent

You can declare additional columns as part of a covering index using the INCLUDE clause:

CREATE INDEX idx_name_dept ON employees (name) INCLUDE (department, salary);

INCLUDE columns are recorded in the catalog metadata so the planner knows the index covers those columns. Note: physical storage of INCLUDE column values in B-Tree leaf nodes is deferred to a future covering-index phase. Until then, the planner uses INCLUDE to correctly identify IndexOnlyScan opportunities, but the values are read from the key portion of the B-Tree entry.

MVCC and the 24-byte header read

Index-only scans still perform a lightweight visibility check per row. For each B-Tree entry, the executor reads only the 24-byte RowHeader (the slot header containing txn_id_created, txn_id_deleted, and sequence number) to determine whether the row is visible to the current transaction snapshot. The full row payload is never decoded.

🚀

Performance Advantage PostgreSQL requires an all-visible map (a per-page bitmap written by VACUUM) to perform true index-only scans — without it, PostgreSQL falls back to a full heap fetch. AxiomDB performs a 24-byte RowHeader read for MVCC instead, which is simpler, requires no VACUUM pass, and still eliminates the expensive full row decode and heap page traversal.

Non-Unique Secondary Index Key Format

Non-unique secondary indexes store the indexed column values together with the row’s RecordId as the B-Tree key:

key = encode_index_key(col_vals) || encode_rid(rid)   // 10-byte RecordId suffix

This ensures every B-Tree entry is globally unique even when multiple rows share the same indexed value — making INSERT safe without a DuplicateKey error.

When looking up all rows with a given indexed value, the executor performs a range scan with synthetic bounds:

lo = encode_index_key(val) || [0x00; 10]   // smallest possible RecordId
hi = encode_index_key(val) || [0xFF; 10]   // largest possible RecordId

⚙️

Design Decision — InnoDB Composite Key Approach This is the same strategy used by MySQL InnoDB secondary indexes, where the primary key is appended as a tiebreaker in the B-Tree entry. AxiomDB uses RecordId (page_id + slot_id + sequence number) instead of a separate primary key column, keeping the suffix at a fixed 10 bytes regardless of the table's key type.

Phase 39.9 adds a second, internal-only secondary-key path for the clustered rewrite: there the physical entry is secondary_key ++ missing_primary_key_columns so a future clustered executor can jump from a secondary entry to the owning PRIMARY KEY row without depending on a heap slot. Phases 39.11 and 39.12 already add internal WAL/rollback and crash recovery for clustered rows by primary key and exact row image, but that path is still not SQL-visible yet.

B+ Tree Implementation Details

AxiomDB’s B+ Tree is a Copy-on-Write structure backed by the StorageEngine trait. Key properties:

ORDER_INTERNAL = 223: up to 223 separator keys and 224 child pointers per internal node
ORDER_LEAF = 217: up to 217 (key, RecordId) pairs per leaf node
16 KB pages: both internal and leaf nodes fit exactly in one page
AtomicU64 root: root page swapped atomically — readers are lock-free
CoW semantics: writes copy the path from root to the modified leaf; old versions are visible to concurrent readers until they finish

See B+ Tree Internals for the on-disk format and the derivation of the ORDER constants.

Embedded Mode

AxiomDB can run in-process — inside your application, with no TCP server, no daemon, no network round-trips. This is the SQLite model: the database is a library you link against, not a process you connect to.

The embedded crate ships two APIs:

API	Language	Use case
`Db`	Rust	Native Rust apps, desktop, CLI tools
`axiomdb_open` / `axiomdb_query` / …	C	C, C++, Python (`ctypes`), Swift, Kotlin JNI, Unity
`AsyncDb`	Rust + Tokio	Async Rust services

⚙️

Design Decision — Local DSN Only Like SQLite's split between URI parsing and VFS-specific validation, AxiomDB now parses DSNs centrally but keeps embedded mode local-only. In Phase 5.15, Db::open_dsn and axiomdb_open_dsn accept filesystem DSNs and reject remote wire endpoints explicitly.

🚀

Zero Network Overhead Every query is a direct function call. No TCP, no packet serialization, no thread context switch. Compared to connecting to a local MySQL or PostgreSQL server (~50–200 µs per query on localhost), an embedded AxiomDB query has no networking overhead at all.

Build profiles

# Cargo.toml
[dependencies]
axiomdb-embedded = { path = "...", features = ["desktop"] }  # default
# axiomdb-embedded = { path = "...", features = ["async-api"] }  # + tokio

Feature	Includes	Binary output
`desktop` (default)	Rust sync API + C FFI	`.dylib` / `.so` / `.dll` + `.a`
`async-api`	+ tokio async wrapper	same + async
`wasm`	sync, in-memory (future)	`.wasm`

The desktop build produces a ~1.1 MB dynamic library. The server binary (with full wire protocol) is ~2.1 MB. You get a leaner binary by only linking what you need.

Rust API

Opening a database

#![allow(unused)]
fn main() {
use axiomdb_embedded::Db;

// Creates ./myapp.db and ./myapp.wal if they don't exist.
// Runs crash recovery automatically if the WAL has uncommitted entries.
// Also verifies every catalog-visible index before returning the handle.
let mut db = Db::open("./myapp.db")?;
let mut db2 = Db::open_dsn("file:/tmp/myapp.db")?;
let mut db3 = Db::open_dsn("axiomdb:///tmp/myapp")?;
}

Remote DSNs such as postgres://user@127.0.0.1:5432/app are not supported by embedded mode in Phase 5.15 and return DbError::InvalidDsn.

💡

Open Can Repair Or Refuse Embedded open now performs startup index verification. A readable-but-divergent index is rebuilt automatically; an unreadable tree returns DbError::IndexIntegrityFailure and the handle is never created.

DDL and DML

#![allow(unused)]
fn main() {
db.execute("CREATE TABLE users (id INT NOT NULL, name TEXT, score REAL)")?;

let affected = db.execute("INSERT INTO users VALUES (1, 'Alice', 9.5)")?;
assert_eq!(affected, 1);

let affected = db.execute("UPDATE users SET score = 10.0 WHERE id = 1")?;
assert_eq!(affected, 1);

let affected = db.execute("DELETE FROM users WHERE score < 5.0")?;
}

SELECT — rows only

#![allow(unused)]
fn main() {
let rows = db.query("SELECT * FROM users WHERE score > 8.0")?;
for row in &rows {
    // row is Vec<Value> — one Value per column
    println!("{:?}", row);
}
}

SELECT — rows + column names

Use query_with_columns when you need the column names at runtime (building a table display, serializing to JSON, passing headers to a UI component, etc.).

#![allow(unused)]
fn main() {
let (columns, rows) = db.query_with_columns("SELECT id, name FROM users")?;

println!("columns: {:?}", columns); // ["id", "name"]

for row in &rows {
    for (col, val) in columns.iter().zip(row.iter()) {
        println!("{col} = {val}");
    }
}
}

Full QueryResult (metadata + last_insert_id)

#![allow(unused)]
fn main() {
use axiomdb_sql::result::QueryResult;

match db.run("INSERT INTO users VALUES (2, 'Bob', 7.2)")? {
    QueryResult::Affected { count, last_insert_id } => {
        println!("inserted {count} row, id = {:?}", last_insert_id);
    }
    QueryResult::Rows { columns, rows } => { /* SELECT */ }
    QueryResult::Empty => { /* DDL */ }
}
}

Explicit transactions

#![allow(unused)]
fn main() {
db.begin()?;
db.execute("INSERT INTO orders VALUES (1, 100.0)")?;
db.execute("UPDATE inventory SET qty = qty - 1 WHERE id = 42")?;
db.commit()?;

// Or:
db.begin()?;
// ... something goes wrong ...
db.rollback()?;
}

Error handling

#![allow(unused)]
fn main() {
match db.query("SELECT * FROM nonexistent") {
    Ok(rows) => { /* ... */ }
    Err(e) => {
        eprintln!("query failed: {e}");
        // Also accessible as a string for display/logging:
        if let Some(msg) = db.last_error() {
            eprintln!("last error: {msg}");
        }
    }
}
}

Async (Tokio)

use axiomdb_embedded::async_db::AsyncDb;

#[tokio::main]
async fn main() {
    let db = AsyncDb::open("./myapp.db").await?;
    let db2 = AsyncDb::open_dsn("file:/tmp/myapp.db").await?;
    db.execute("CREATE TABLE t (id INT)").await?;

    let (columns, rows) = db.query_with_columns("SELECT * FROM t").await?;
}

AsyncDb wraps Db in an Arc<Mutex<Db>> and runs each operation in tokio::task::spawn_blocking, keeping the async executor unblocked.

Persist and reopen

The database persists on disk. Close it (drop the Db) and reopen it from another process or session:

#![allow(unused)]
fn main() {
{
    let mut db = Db::open("./data.db")?;
    db.execute("CREATE TABLE log (ts BIGINT, msg TEXT)")?;
    db.execute("INSERT INTO log VALUES (1700000000, 'started')")?;
} // db is dropped here — WAL is flushed, file lock released

// Later — in the same process or a different one:
let mut db = Db::open("./data.db")?;
let rows = db.query("SELECT * FROM log")?;
assert_eq!(rows.len(), 1);
}

C API

Link against libaxiomdb_embedded.{so,dylib,dll} or the static libaxiomdb_embedded.a.

#include "axiomdb.h"

A minimal axiomdb.h to copy into your project:

#pragma once
#include <stdint.h>
#include <stddef.h>

typedef struct AxiomDb    AxiomDb;
typedef struct AxiomRows  AxiomRows;

/* Type codes — same as SQLite for easy porting */
#define AXIOMDB_NULL     0
#define AXIOMDB_INTEGER  1   /* Bool, Int, BigInt, Date (days), Timestamp (µs) */
#define AXIOMDB_REAL     2   /* Real, Decimal */
#define AXIOMDB_TEXT     3   /* Text, UUID */
#define AXIOMDB_BLOB     4   /* Bytes */

/* Open / close */
AxiomDb*    axiomdb_open        (const char* path);
AxiomDb*    axiomdb_open_dsn    (const char* dsn);
void        axiomdb_close       (AxiomDb* db);

/* Execute DML/DDL — returns rows affected, or -1 on error */
int64_t     axiomdb_execute     (AxiomDb* db, const char* sql);

/* Query — returns result set, or NULL on error */
AxiomRows*  axiomdb_query       (AxiomDb* db, const char* sql);

/* Result set accessors */
int64_t     axiomdb_rows_count        (const AxiomRows* rows);
int32_t     axiomdb_rows_columns      (const AxiomRows* rows);
const char* axiomdb_rows_column_name  (const AxiomRows* rows, int32_t col);
int32_t     axiomdb_rows_type         (const AxiomRows* rows, int64_t row, int32_t col);
int64_t     axiomdb_rows_get_int      (const AxiomRows* rows, int64_t row, int32_t col);
double      axiomdb_rows_get_double   (const AxiomRows* rows, int64_t row, int32_t col);
const char* axiomdb_rows_get_text     (const AxiomRows* rows, int64_t row, int32_t col);
const uint8_t* axiomdb_rows_get_blob  (const AxiomRows* rows, int64_t row, int32_t col, size_t* len);
void        axiomdb_rows_free         (AxiomRows* rows);

/* Last error message for this db handle — NULL if last operation succeeded */
const char* axiomdb_last_error  (const AxiomDb* db);

Complete example

#include <stdio.h>
#include <stdint.h>
#include "axiomdb.h"

int main(void) {
    AxiomDb* db = axiomdb_open("./app.db");
    AxiomDb* db2 = axiomdb_open_dsn("file:/tmp/app.db");
    if (!db) { fprintf(stderr, "failed to open db\n"); return 1; }

    axiomdb_execute(db,
        "CREATE TABLE IF NOT EXISTS products ("
        "  id INT NOT NULL, name TEXT, price REAL, active INTEGER"
        ")");

    axiomdb_execute(db, "INSERT INTO products VALUES (1, 'Widget', 9.99, 1)");
    axiomdb_execute(db, "INSERT INTO products VALUES (2, 'Gadget', 24.50, 1)");
    axiomdb_execute(db, "INSERT INTO products VALUES (3, 'Donut', 1.25, 0)");

    AxiomRows* rows = axiomdb_query(db,
        "SELECT id, name, price FROM products WHERE active = 1");

    if (!rows) {
        fprintf(stderr, "query error: %s\n", axiomdb_last_error(db));
        axiomdb_close(db);
        return 1;
    }

    /* Print header */
    int32_t ncols = axiomdb_rows_columns(rows);
    for (int32_t c = 0; c < ncols; c++) {
        printf("%-12s", axiomdb_rows_column_name(rows, c));
    }
    printf("\n");

    /* Print rows */
    int64_t nrows = axiomdb_rows_count(rows);
    for (int64_t r = 0; r < nrows; r++) {
        for (int32_t c = 0; c < ncols; c++) {
            switch (axiomdb_rows_type(rows, r, c)) {
                case AXIOMDB_INTEGER:
                    printf("%-12lld", (long long)axiomdb_rows_get_int(rows, r, c));
                    break;
                case AXIOMDB_REAL:
                    printf("%-12.2f", axiomdb_rows_get_double(rows, r, c));
                    break;
                case AXIOMDB_TEXT:
                    printf("%-12s", axiomdb_rows_get_text(rows, r, c));
                    break;
                case AXIOMDB_NULL:
                    printf("%-12s", "NULL");
                    break;
                default:
                    printf("%-12s", "?");
            }
        }
        printf("\n");
    }

    axiomdb_rows_free(rows);
    axiomdb_close(db);
    axiomdb_close(db2);
    return 0;
}

Output:

id          name        price
1           Widget      9.99
2           Gadget      24.50

Type mapping

SQL type	C accessor	C type
`BOOL`	`axiomdb_rows_get_int`	`0` or `1`
`INT`	`axiomdb_rows_get_int`	`int64_t`
`BIGINT`	`axiomdb_rows_get_int`	`int64_t`
`REAL` / `DOUBLE`	`axiomdb_rows_get_double`	`double`
`DECIMAL`	`axiomdb_rows_get_double`	`double` (may lose precision for >15 digits)
`TEXT` / `VARCHAR`	`axiomdb_rows_get_text`	`const char*` (UTF-8)
`UUID`	`axiomdb_rows_get_text`	`const char*` (`xxxxxxxx-xxxx-…`)
`DATE`	`axiomdb_rows_get_int`	days since 1970-01-01
`TIMESTAMP`	`axiomdb_rows_get_int`	microseconds since 1970-01-01 UTC
`BLOB` / `BYTEA`	`axiomdb_rows_get_blob`	`const uint8_t*` + `size_t len`
`NULL`	type code = `AXIOMDB_NULL`	—

💡

Pointer lifetimes All pointers returned by axiomdb_rows_get_text, axiomdb_rows_get_blob, and axiomdb_rows_column_name are valid until axiomdb_rows_free is called. Copy the data if you need it to outlive the result set.

Python (ctypes)

import ctypes, os

lib = ctypes.CDLL("./libaxiomdb_embedded.dylib")  # or .so on Linux

lib.axiomdb_open.restype = ctypes.c_void_p
lib.axiomdb_open.argtypes = [ctypes.c_char_p]

lib.axiomdb_execute.restype = ctypes.c_int64
lib.axiomdb_execute.argtypes = [ctypes.c_void_p, ctypes.c_char_p]

lib.axiomdb_query.restype = ctypes.c_void_p
lib.axiomdb_query.argtypes = [ctypes.c_void_p, ctypes.c_char_p]

lib.axiomdb_rows_count.restype = ctypes.c_int64
lib.axiomdb_rows_count.argtypes = [ctypes.c_void_p]

lib.axiomdb_rows_get_text.restype = ctypes.c_char_p
lib.axiomdb_rows_get_text.argtypes = [ctypes.c_void_p, ctypes.c_int64, ctypes.c_int32]

lib.axiomdb_rows_free.argtypes = [ctypes.c_void_p]
lib.axiomdb_close.argtypes = [ctypes.c_void_p]

db = lib.axiomdb_open(b"./app.db")
lib.axiomdb_execute(db, b"CREATE TABLE IF NOT EXISTS t (id INT, name TEXT)")
lib.axiomdb_execute(db, b"INSERT INTO t VALUES (1, 'hello')")

rows = lib.axiomdb_query(db, b"SELECT id, name FROM t")
for r in range(lib.axiomdb_rows_count(rows)):
    id_  = lib.axiomdb_rows_get_text(rows, r, 0)
    name = lib.axiomdb_rows_get_text(rows, r, 1)
    print(f"id={id_.decode()}, name={name.decode()}")

lib.axiomdb_rows_free(rows)
lib.axiomdb_close(db)

Build the shared library

# Dynamic library (.dylib / .so / .dll)
cargo build --release -p axiomdb-embedded

# Static library (.a) — for iOS, embedded systems, Unity AOT
cargo build --release -p axiomdb-embedded
# → target/release/libaxiomdb_embedded.a

# With async support
cargo build --release -p axiomdb-embedded --features async-api

Output files are in target/release/:

macOS: libaxiomdb_embedded.dylib
Linux: libaxiomdb_embedded.so
Windows: axiomdb_embedded.dll
All platforms: libaxiomdb_embedded.a (static)

Error Reference

AxiomDB returns structured errors with a SQLSTATE code, a human-readable message, and optional detail fields. Understanding these codes allows applications to handle specific failure scenarios correctly (for example: catching a uniqueness violation to show a “email already taken” message rather than a generic crash page).

Error Format

Every error from AxiomDB is represented as an ErrorResponse struct with these fields:

Field	Type	Always present?	Description
`sqlstate`	string (5 chars)	Yes	SQLSTATE code for programmatic handling (e.g. `"23505"`)
`severity`	string	Yes	`"ERROR"`, `"WARNING"`, or `"NOTICE"`
`message`	string	Yes	Short human-readable description. Do not parse this — use `sqlstate`
`detail`	string	Sometimes	Extended context about the failure (offending value, referenced row)
`hint`	string	Sometimes	Actionable suggestion for how to fix the error
`position`	integer	Sometimes	0-based byte offset of the unexpected token in the SQL string (parse errors only)

{
  "sqlstate": "23505",
  "severity": "ERROR",
  "message": "unique key violation on index 'users_email_idx'",
  "detail": "Key (value)=(alice@example.com) is already present in index users_email_idx.",
  "hint": "A row with the same value already exists in index users_email_idx. Use INSERT ... ON CONFLICT to handle duplicates."
}

{
  "sqlstate": "42601",
  "severity": "ERROR",
  "message": "SQL syntax error: unexpected token 'FORM'",
  "position": 9
}

Always use sqlstate for programmatic handling. The message text may change between versions; SQLSTATE codes are stable.

When using the MySQL wire protocol, the error is delivered as a MySQL error packet with the SQLSTATE code in the sql_state field (5 bytes following the # marker).

JSON Error Format

For clients that need structured errors without screen-scraping message strings, AxiomDB supports a JSON error format that carries all ErrorResponse fields in the MySQL ERR packet:

SET error_format = 'json';

After this, every ERR packet carries a JSON string instead of plain text:

{"code":1064,"sqlstate":"42601","severity":"ERROR","message":"SQL syntax error: unexpected token 'FORM'","position":9}

{"code":1062,"sqlstate":"23505","severity":"ERROR","message":"unique key violation on index 'users_email_idx'","detail":"Key (value)=(alice@example.com) is already present in index users_email_idx."}

Reset to plain text with SET error_format = 'text'. This setting is per-connection and does not persist after disconnect.

⚙️

Design Decision — JSON on MySQL Wire The MySQL wire protocol has no structured error field beyond a plain string message. AxiomDB encodes the full ErrorResponse as a JSON string in that message field when error_format = 'json' is set. This mirrors how PostgreSQL's ErrorResponse packet carries detail, hint, and position in separate fields — achieving the same semantics over MySQL's more limited protocol.

Integrity Constraint Violations (Class 23)

These errors indicate that an INSERT, UPDATE, or DELETE violated a declared constraint. The application should handle them and return a user-facing message.

💡

Constraint Enforcement Status (Phase 4.16) The following constraints are parsed and stored in the schema but are not yet enforced at INSERT/UPDATE time:

NOT NULL — declared columns accept NULL without error
UNIQUE — duplicate values are allowed
CHECK — expressions are not evaluated at write time

As a result, 23502, 23505, and 23514 are not raised by DML in the current release. Enforcement will be added in a future phase. PRIMARY KEY uniqueness is enforced via the B+ tree index.

23505 — unique_violation

A row with the same value already exists in a column or set of columns declared UNIQUE or PRIMARY KEY.

CREATE TABLE users (email TEXT NOT NULL UNIQUE);
INSERT INTO users VALUES ('alice@example.com');
INSERT INTO users VALUES ('alice@example.com');  -- ERROR 23505

The error message identifies both the index and the offending value:

Duplicate entry 'alice@example.com' for key 'users_email_uq'

The detail field (available in JSON format) provides a PostgreSQL-style message:

Key (value)=(alice@example.com) is already present in index users_email_uq.

Typical application response: Show “An account with this email already exists.”

try:
    db.execute("INSERT INTO users (email) VALUES (?)", [email])
except AxiomDbError as e:
    if e.sqlstate == '23505':
        return {"error": "Email already taken"}
    raise

23503 — foreign_key_violation

Child insert / update — parent key does not exist

An INSERT or UPDATE references a value in the FK column that has no matching row in the parent table.

INSERT INTO orders (user_id, total) VALUES (99999, 100);
-- ERROR 23503: Foreign key constraint fails: 'orders.user_id' = '99999'

Typical response: Validate that the referenced entity exists before inserting, or surface “Referenced record not found.”

Parent delete — children still reference it (RESTRICT / NO ACTION)

A DELETE on the parent table was blocked because child rows reference the row being deleted and the FK action is RESTRICT or NO ACTION (the default).

-- orders.user_id REFERENCES users(id) ON DELETE RESTRICT
DELETE FROM users WHERE id = 1;
-- ERROR 23503: foreign key constraint "fk_orders_user": orders.user_id references this row

Typical response: Either delete child rows first, use ON DELETE CASCADE, or prevent parent deletion in the application layer.

Cascade depth exceeded

A chain of ON DELETE CASCADE constraints exceeded the maximum depth of 10 levels.

-- If table chain A→B→C→...→K (11 levels all with CASCADE) and you delete from A:
DELETE FROM a WHERE id = 1;
-- ERROR 23503: foreign key cascade depth exceeded limit of 10

Typical response: Restructure the schema to reduce cascade depth, or perform the deletes manually level-by-level.

SET NULL on a NOT NULL column

ON DELETE SET NULL is defined on a foreign key column that was declared NOT NULL.

-- orders.user_id is NOT NULL, but ON DELETE SET NULL is declared
DELETE FROM users WHERE id = 1;
-- ERROR 23503: cannot set FK column orders.user_id to NULL: column is NOT NULL

Typical response: Either remove the NOT NULL constraint from the FK column, or change the action to ON DELETE RESTRICT or ON DELETE CASCADE.

23502 — not_null_violation

An INSERT or UPDATE attempted to store NULL in a NOT NULL column.

INSERT INTO users (name, email) VALUES (NULL, 'bob@example.com');
-- ERROR 23502: null value in column "name" violates not-null constraint

Typical application response: Validate required fields on the client before submitting.

23514 — check_violation

A row failed a CHECK constraint.

INSERT INTO products (name, price) VALUES ('Widget', -5.00);
-- ERROR 23514: new row for relation "products" violates check constraint "chk_price_positive"

Startup / Open Errors

These errors happen before a SQL statement runs. They are returned by Db::open(...), Db::open_dsn(...), AsyncDb::open(...), or server startup, so there is no SQLSTATE-bearing result set yet.

`IndexIntegrityFailure` — open refused because an index is not trustworthy

On every open, AxiomDB now verifies each catalog-visible index against the heap-visible rows reconstructed after WAL recovery.

If an index is readable but missing entries or contains extra entries, AxiomDB rebuilds it automatically before accepting traffic.
If the index tree cannot be traversed safely, open fails with DbError::IndexIntegrityFailure.

Example Rust handling:

#![allow(unused)]
fn main() {
match axiomdb_embedded::Db::open("./data.db") {
    Ok(db) => { /* ready */ }
    Err(axiomdb_core::DbError::IndexIntegrityFailure { table, index, reason }) => {
        eprintln!("database refused to open: {table}.{index}: {reason}");
    }
    Err(other) => return Err(other),
}
}

⚙️

Design Decision — Repair What Is Readable AxiomDB borrows PostgreSQL amcheck's “fail if the tree is not safely readable” discipline, but borrows SQLite's “rebuild from table contents” recovery idea for readable divergence. A readable-but-wrong index is rebuilt automatically; an unreadable tree blocks open.

Cardinality Errors (Class 21)

21000 — cardinality_violation

A scalar subquery returned more than one row. Scalar subqueries (a SELECT used where a single value is expected) must return exactly one row. Zero rows yield NULL; more than one row is an error.

-- Suppose users contains Alice and Bob
SELECT (SELECT name FROM users) AS single_name FROM orders;
-- ERROR 21000: subquery must return exactly one row, but returned 2 rows

Fix: add a WHERE condition that makes the result unique, or use LIMIT 1 if you intentionally want only the first row:

-- Safe: guaranteed single row via primary key
SELECT (SELECT name FROM users WHERE id = o.user_id) AS customer_name
FROM orders o;

-- Safe: explicit LIMIT 1 when you want "any one" result
SELECT (SELECT name FROM users ORDER BY created_at LIMIT 1) AS oldest_user
FROM config;

try:
    db.execute("SELECT (SELECT name FROM users) FROM orders")
except AxiomDbError as e:
    if e.sqlstate == '21000':
        # The subquery returned multiple rows — add a WHERE clause
        ...

Undefined Object Errors (Class 42)

These errors indicate a reference to an object (table, column, index) that does not exist in the catalog. They are typically programming errors caught in development.

42P01 — undefined_table

A statement referenced a table or view that does not exist.

SELECT * FROM nonexistent_table;
-- ERROR 42P01: relation "nonexistent_table" does not exist

42703 — undefined_column

A statement referenced a column that does not exist in the specified table.

SELECT typo_column FROM users;
-- ERROR 42703: column "typo_column" does not exist in table "users"

42P07 — duplicate_table

CREATE TABLE was called for a table that already exists (without IF NOT EXISTS).

CREATE TABLE users (...);
CREATE TABLE users (...);
-- ERROR 42P07: relation "users" already exists

42701 — duplicate_column

ALTER TABLE ... ADD COLUMN was called for a column that already exists in the table.

CREATE TABLE users (id BIGINT PRIMARY KEY, email TEXT NOT NULL);
ALTER TABLE users ADD COLUMN email TEXT;
-- ERROR 42701: column "email" already exists in table "users"

Fix: Use a different column name, or check the current schema with DESCRIBE users before adding the column.

42702 — ambiguous_column

An unqualified column name appears in multiple tables in the FROM clause.

-- Both users and orders have a column named "id"
SELECT id FROM users JOIN orders ON orders.user_id = users.id;
-- ERROR 42702: column reference "id" is ambiguous

-- Fix: qualify the column
SELECT users.id FROM users JOIN orders ON orders.user_id = users.id;

Database Catalog Errors

These errors are surfaced primarily through the MySQL wire protocol when a client uses CREATE DATABASE, DROP DATABASE, USE, the handshake database, or COM_INIT_DB.

1049 (42000) — Unknown database

The requested database does not exist in the persisted catalog.

USE missing_db;
-- ERROR 1049 (42000): Unknown database 'missing_db'

This same error is returned if a client connects with database=missing_db in the initial MySQL handshake.

Fix: create the database first with CREATE DATABASE missing_db, or switch to an existing one from SHOW DATABASES.

1007 (HY000) — Database already exists

CREATE DATABASE was called for a name already present in the catalog.

CREATE DATABASE analytics;
CREATE DATABASE analytics;
-- ERROR 1007 (HY000): Can't create database 'analytics'; database exists

Fix: choose a different name, or treat the existing database as reusable.

1105 (HY000) — Active database cannot be dropped

The current connection attempted to drop the database it has selected.

USE analytics;
DROP DATABASE analytics;
-- ERROR 1105 (HY000): Can't drop database 'analytics'; database is currently selected

Fix: switch to another database such as axiomdb, then run DROP DATABASE.

Transaction Errors (Class 40)

40001 — serialization_failure

A concurrent write conflict was detected. The transaction must be retried.

-- Two transactions try to update the same row simultaneously.
-- The second one receives:
-- ERROR 40001: could not serialize access due to concurrent update

The application must catch this and retry the transaction. This is normal and expected behavior under high concurrency, not a bug.

40P01 — deadlock_detected

Two transactions are each waiting for a lock held by the other.

-- Txn A holds lock on row 1, waiting for row 2
-- Txn B holds lock on row 2, waiting for row 1
-- → AxiomDB detects the cycle and aborts one transaction with 40P01
-- ERROR 40P01: deadlock detected

Prevention: Access rows in a consistent order across all transactions. If you always acquire locks on (accounts with lower id) before (accounts with higher id), deadlocks cannot form between two such transactions.

I/O and System Errors (Class 58)

58030 — io_error

The storage engine encountered an operating system I/O error.

ERROR 58030: could not write to file "axiomdb.db": No space left on device

Possible causes:

Disk full — free space or expand the volume
File permissions — ensure the AxiomDB process can write to the data directory
Hardware error — check dmesg / system logs for disk errors

Syntax and Parse Errors (Class 42)

42601 — syntax_error

The SQL statement is not syntactically valid.

SELECT FORM users;  -- 'FORM' is not a keyword
-- ERROR 42601: syntax error at or near "FORM"
-- Position: 8

42883 — undefined_function

A function name was called that does not exist.

SELECT unknown_function(1);
-- ERROR 42883: function "unknown_function" does not exist

Data Errors (Class 22)

22001 — string_data_right_truncation

A TEXT or VARCHAR value exceeds the column’s declared length.

CREATE TABLE codes (code CHAR(3));
INSERT INTO codes VALUES ('TOOLONG');
-- ERROR 22001: value too long for type CHAR(3)

22003 — numeric_value_out_of_range

A numeric value exceeds the range of its declared type.

INSERT INTO users (age) VALUES (99999);  -- age is SMALLINT
-- ERROR 22003: integer out of range for type SMALLINT

22012 — division_by_zero

Division by zero in an arithmetic expression.

SELECT 10 / 0;
-- ERROR 22012: division by zero

22018 — invalid_character_value_for_cast

A value cannot be implicitly coerced to the target type. This error is raised when AxiomDB is in strict mode (the default) and a conversion is attempted that would discard data or is not defined.

-- Text with non-numeric characters inserted into an INT column (strict mode):
INSERT INTO users (age) VALUES ('42abc');
-- ERROR 22018: cannot coerce '42abc' (Text) to INT: '42abc' is not a valid integer

-- A type pair with no implicit conversion:
SELECT 3.14 + DATE '2026-01-01';
-- ERROR 22018: cannot coerce 3.14 (Real) to Date: no implicit numeric promotion between these types

Hint: Use explicit CAST for conversions that AxiomDB does not apply automatically:

INSERT INTO users (age) VALUES (CAST('42' AS INT));   -- explicit — always works
SELECT CAST(3 AS REAL) + 1.5;                         -- explicit widening

Permissive mode: if your application requires MySQL-style lenient coercion ('42abc' silently converted to 42), disable strict mode for the session:

SET strict_mode = OFF;   -- or: SET sql_mode = ''

In permissive mode, failed coercions fall back to a best-effort conversion and emit warning 1265 instead of returning 22018. Use SHOW WARNINGS after bulk loads to audit any truncated values. See Strict Mode for full details.

Implicit coercions that always succeed (no error)

The following conversions happen automatically without raising 22018:

From	To	Example
`INT`	`BIGINT`	`1 + 9999999999` → `BIGINT`
`INT`	`REAL`	`5 + 1.5` → `Real(6.5)`
`INT`	`DECIMAL`	`2 + 3.14` → `Decimal(5.14)`
`BIGINT`	`REAL`	`100 + 1.5` → `Real(101.5)`
`BIGINT`	`DECIMAL`	`100 + 3.14` → `Decimal(103.14)`
`BIGINT`	`INT`	only if value fits in INT range
`TEXT`	`INT` / `BIGINT`	`'42'` → `42` (strict: entire string must be a number)
`TEXT`	`REAL`	`'3.14'` → `3.14`
`TEXT`	`DECIMAL`	`'3.14'` → `Decimal(314, 2)`
`DATE`	`TIMESTAMP`	midnight UTC of the given date
`NULL`	any	always passes through as `NULL`

Connection Protocol Errors (Class 08)

MySQL 1153 / 08S01 — ER_NET_PACKET_TOO_LARGE

Returned when an incoming MySQL logical command payload exceeds the connection’s current max_allowed_packet limit.

ERROR 1153 (08S01): Got a packet bigger than 'max_allowed_packet' bytes

What triggers it:

A COM_QUERY whose SQL text exceeds @@max_allowed_packet bytes.
A COM_STMT_PREPARE or COM_STMT_EXECUTE packet above the limit.
A HandshakeResponse41 above the default 64 MiB limit (rare in practice).
A multi-packet logical command whose total reassembled payload exceeds the limit, even if each individual physical fragment is below the limit.

What happens after the error: The server closes the connection immediately. The stream cannot be safely reused because the framing layer cannot determine where the next command begins.

Fix: Raise max_allowed_packet before sending the large command:

SET max_allowed_packet = 134217728;  -- 128 MiB

Or reconnect after the error — the new connection starts with the server default.

💡

Session Scope SET max_allowed_packet affects only the current connection. Use it before any statement whose payload may be large (e.g., bulk INSERT with many values, or a BLOB upload via COM_STMT_EXECUTE).

Disk-Full Errors (Class 53)

53100 — disk_full

Returned when the OS reports that the volume is full (ENOSPC) or over quota (EDQUOT) during a durable write — a WAL append, WAL fsync, storage grow, or mmap flush.

ERROR 53100: disk full during 'wal commit fsync': no space left on device
HINT: The database volume is full or over quota. Free disk space and restart
      the server to restore write access. The database is now in read-only
      degraded mode.

What happens after the error:

AxiomDB enters read-only degraded mode immediately. In this mode:

Statement type	Allowed?
`SELECT`, `SHOW`, `EXPLAIN`	✅ Yes
`SET` (session variables)	✅ Yes
`INSERT`, `UPDATE`, `DELETE`, `TRUNCATE`	❌ No — returns `53100`
`CREATE TABLE`, `DROP TABLE`, DDL	❌ No — returns `53100`
`BEGIN`, `COMMIT`, `ROLLBACK`	❌ No — returns `53100`

The mode persists until the server process is restarted. There is no way to return to read-write mode without restarting.

Fix:

Free disk space or remove the quota restriction.
Restart the server — AxiomDB will reopen in read-write mode if space is available.

💡

Reads Are Always Safe In degraded mode, all read traffic continues uninterrupted. Applications can continue serving queries while the operator resolves the disk space issue — no connection restart needed for reads.

Complete SQLSTATE Reference

SQLSTATE	Name	Common Cause
`21000`	cardinality_violation	Scalar subquery returned more than 1 row
`23505`	unique_violation	Duplicate value in UNIQUE / PK column
`23503`	foreign_key_violation	Referencing non-existent FK target
`23502`	not_null_violation	NULL inserted into NOT NULL column
`23514`	check_violation	Row failed a CHECK constraint
`40001`	serialization_failure	Write-write conflict; retry the txn
`40P01`	deadlock_detected	Circular lock dependency
`42P01`	undefined_table	Table does not exist
`42703`	undefined_column	Column does not exist
`42702`	ambiguous_column	Unqualified column name is ambiguous
`42P07`	duplicate_table	Table already exists
`42701`	duplicate_column	Column already exists in table
`42601`	syntax_error	Malformed SQL
`42883`	undefined_function	Unknown function name
`22001`	string_data_right_truncation	Value too long for column type
`22003`	numeric_value_out_of_range	Number exceeds type bounds
`22012`	division_by_zero	Division by zero in expression
`22018`	invalid_character_value_for_cast	Implicit type coercion failed
`22P02`	invalid_text_representation	Invalid literal value
`42501`	insufficient_privilege	Permission denied on object
`42702`	ambiguous_column	Unqualified column matches in 2+ tables
`42804`	datatype_mismatch	Type mismatch in expression
`25001`	active_sql_transaction	BEGIN inside an active transaction
`25P01`	no_active_sql_transaction	COMMIT/ROLLBACK with no active transaction
`25006`	read_only_sql_transaction	Transaction expired
`0A000`	feature_not_supported	SQL feature not yet implemented
`08S01`	connection_failure (MySQL ext)	Incoming packet exceeds max_allowed_packet
`53100`	disk_full	Storage volume is full
`58030`	io_error	OS-level I/O failure (disk, permissions)

Performance

AxiomDB is designed to outperform MySQL on specific workloads by eliminating several layers of redundant work: double-buffering, the double-write buffer, row-by-row query evaluation, and thread-per-connection overhead. This page presents current benchmark numbers and guidance on how to write queries and schemas that stay fast.

Benchmark Results

All benchmarks run on Apple M2 Pro (12 cores), 32 GB RAM, NVMe SSD, single-threaded, warm data (all pages in OS page cache unless noted).

SQL Parser Throughput

Query type	AxiomDB (logos lexer)	MySQL ~	PostgreSQL ~	Ratio vs MySQL
Simple SELECT (1 tbl)	492 ns	~500 ns	~450 ns	1.0× (parity)
Complex SELECT (JOINs)	2.7 µs	~4.0 µs	~3.5 µs	1.5× faster
DDL (CREATE TABLE)	1.1 µs	~2.5 µs	~2.0 µs	2.3× faster
Batch (100 stmts)	47 µs	~90 µs	~75 µs	1.9× faster

Compared to sqlparser-rs (the common Rust SQL parser library):

Query type	AxiomDB	sqlparser-rs	Ratio
Simple SELECT	492 ns	4.8 µs	9.8× faster
Complex SELECT	2.7 µs	46 µs	17× faster

The speed advantage comes from two decisions:

logos DFA lexer — compiles the token patterns to a Deterministic Finite Automaton at compile time. Token scanning is O(n) with a very small constant.
Zero-copy tokens — Ident and QuotedIdent tokens are &'src str slices into the original input. No heap allocation occurs during lexing.

Storage Engine Throughput

Operation	AxiomDB	Target	Max acceptable	Status
B+ Tree point lookup (1M)	1.2M ops/s	800K ops/s	600K ops/s	✅
Range scan 10K rows	0.61 ms	45 ms	60 ms	✅
B+ Tree INSERT (storage only)	195K ops/s	180K ops/s	150K ops/s	✅
Sequential scan 1M rows	0.72 s	0.8 s	1.2 s	✅
Concurrent reads ×16	linear	linear	<2× degradation	✅

Wire Protocol Throughput (Phase 5.14)

End-to-end throughput measured via the MySQL wire protocol (pymysql client, autocommit mode, 1 connection, localhost). Includes: network round-trip, protocol encode/decode, parse, analyze, execute, WAL, MmapStorage.

Operation	Throughput	Notes
COM_PING	24,865 pings/s	Pure protocol overhead baseline
SET NAMES (intercepted)	46,672 q/s	Handled in protocol layer, no SQL engine
SELECT 1 (autocommit)	185 q/s	Full SQL pipeline, read-only
INSERT (autocommit, 1 fsync/stmt)	58 q/s	Full SQL pipeline + fsync for durability

The 185 q/s SELECT result reflects a 3.3× improvement in Phase 5.14 over the prior 56 q/s baseline. Read-only transactions (SELECT, SHOW, etc.) no longer fsync the WAL — see Benchmarks → Phase 5.14 for the technical explanation.

Remaining bottlenecks:

INSERT (single connection): one fdatasync per autocommit statement; enable Group Commit for concurrent workloads (see below)

Primary-Key Lookups After `6.16`

Phase 6.16 removes the planner blind spot that still treated WHERE id = ... as a scan on PK-only tables. The PRIMARY KEY B+Tree is now used for single-table equality and range lookups.

Measured with python3 benches/comparison/local_bench.py --scenario select_pk --rows 5000 --table on the same machine:

Operation	MariaDB 12.1	MySQL 8.0	AxiomDB
`SELECT * FROM bench_users WHERE id = literal`	12.7K lookups/s	13.4K lookups/s	11.1K lookups/s

The old debt was “planner never reaches the PK B+Tree”. That is now closed. The remaining gap is smaller and sits after planning: row materialization and MySQL packet serialization still cost more than MariaDB/MySQL on this path.

DELETE WHERE / UPDATE After `5.20`

Phase 5.19 removed the old-key delete bottleneck for DELETE ... WHERE and the old-key half of UPDATE. Phase 5.20 finishes the real UPDATE fix for the benchmark schema by preserving the heap RecordId when the new row fits in the same slot, which makes selective index skipping correct.

Measured with python3 benches/comparison/local_bench.py --scenario all --rows 50000 --table on the same machine:

Operation	MariaDB 12.1	MySQL 8.0	AxiomDB	PostgreSQL 16
`DELETE WHERE id > 25000`	652K rows/s	662K rows/s	1.13M rows/s	3.76M rows/s
`UPDATE ... WHERE active = TRUE`	662K rows/s	404K rows/s	648K rows/s	270K rows/s

Compared to the 4.6K rows/s pre-5.19 DELETE-WHERE baseline that originally triggered this work, AxiomDB now stays in the same order of magnitude as MySQL and MariaDB on the same local benchmark. More importantly, compared to the 52.9K rows/s post-5.19 / pre-5.20 UPDATE baseline, the stable-RID path raises AxiomDB UPDATE throughput to 648K rows/s on the same 50K-row benchmark.

🚀

Faster Than MySQL On DELETE WHERE At 50K rows, AxiomDB `DELETE WHERE id > 25000` reaches 1.13M rows/s vs MySQL 8.0 at 662K rows/s. The gain comes from eliminating the old one-`delete_in(...)`-per-row loop and replacing it with one ordered batch delete per index.

🚀

12× UPDATE Gain `5.20` lifts AxiomDB `UPDATE ... WHERE active = TRUE` from 52.9K rows/s to 648K rows/s by preserving heap `RecordId`s on same-slot rewrites and skipping PK maintenance when only non-indexed columns change.

The main remaining write-path bottleneck is now INSERT, not UPDATE.

Indexed `UPDATE ... WHERE` After `6.20`

Phase 6.17 removed the old full-scan candidate discovery path for indexed UPDATE predicates. Phase 6.20 then removed the dominant apply-side costs on the default PK-range benchmark: candidate heap reads are batched by page, no-op rows skip physical mutation, stable-RID rewrites batch their WAL append, and index maintenance only runs when a key, predicate membership, or RID really changes.

Measured with python3 benches/comparison/local_bench.py --scenario update_range --rows 5000 --table on the same machine:

Operation	MariaDB 12.1	MySQL 8.0	AxiomDB
`UPDATE bench_users SET score = score + 1 WHERE id BETWEEN ...`	618K rows/s	291K rows/s	369.9K rows/s

Compared to the 6.17 result (85.2K rows/s), the 6.20 apply fast path is a 4.3x improvement on the same benchmark and now exceeds the documented local MySQL result. The remaining gap is specifically MariaDB’s tighter clustered-row update path, not AxiomDB’s old discovery-side O(n) scan.

🚀

Performance Advantage On the default PK-only `update_range` benchmark, AxiomDB now reaches 369.9K rows/s vs MySQL 8.0 at 291K rows/s because `6.20` keeps the whole statement inside a batched heap/WAL apply path instead of paying per-row reads and per-row `UpdateInPlace` appends.

⚙️

Design Decision — Reuse WAL Format MariaDB and PostgreSQL both optimize UPDATE by changing how batches are applied before inventing a new log record type. AxiomDB follows that rule here: `6.20` keeps the existing `UpdateInPlace` WAL format for rollback and recovery, but batches normal entries through one `reserve_lsns + write_batch` call per statement.

INSERT in Explicit Transactions After `5.21`

Phase 5.21 adds transactional INSERT staging for consecutive INSERT ... VALUES statements inside one explicit transaction. Instead of writing heap + WAL + index roots per statement, AxiomDB now buffers eligible rows and flushes them together on COMMIT or the next barrier statement.

Measured with python3 benches/comparison/local_bench.py --scenario insert --rows 50000 --table on the same machine:

Operation	MariaDB 12.1	MySQL 8.0	AxiomDB
`50K` single-row `INSERT`s in `1` explicit txn	28.0K rows/s	26.7K rows/s	23.9K rows/s

⚙️

Design Decision — Stage, Then Flush PostgreSQL's heap_multi_insert() and DuckDB's appender both separate row production from physical write. AxiomDB adapts that idea to SQL-visible transactions: the connection keeps staged INSERT rows in memory, then flushes them in one grouped heap/index pass when SQL semantics require visibility.

This path targets one specific workload: many separate INSERT statements inside BEGIN ... COMMIT. Autocommit throughput remains a different problem and depends on the server-side fsync path.

Multi-row INSERT on Indexed Tables After `6.18`

Phase 6.18 fixes the immediate multi-row VALUES path for indexed tables. A statement such as:

INSERT INTO bench_users VALUES
  (1, 'u1', 18, TRUE, 100.0, 'u1@b.local'),
  (2, 'u2', 19, FALSE, 100.1, 'u2@b.local'),
  (3, 'u3', 20, TRUE, 100.2, 'u3@b.local');

now uses grouped heap/index apply even when the target table has a PRIMARY KEY or secondary indexes. Before 6.18, that path still fell back to per-row maintenance on indexed tables.

Measured with python3 benches/comparison/local_bench.py --scenario insert_multi_values --rows 5000 --table on the benchmark schema with PRIMARY KEY (id):

Operation	MariaDB 12.1	MySQL 8.0	AxiomDB
`insert_multi_values` on PK table	160,581 rows/s	259,854 rows/s	321,002 rows/s

🚀

2× Faster Than MariaDB On the PK-only multi-row INSERT benchmark, AxiomDB reaches 321,002 rows/s vs MariaDB 12.1 at 160,581 rows/s. The speedup comes from one grouped heap/index apply per VALUES statement instead of per-row maintenance on the indexed table.

💡

Prefer Multi-row VALUES If your application already knows several rows up front, send one INSERT ... VALUES (...), (...) statement instead of many one-row INSERTs. This now benefits indexed tables too, while still rejecting duplicate PRIMARY KEY / UNIQUE values inside the same statement.

Prepared Statement Plan Cache (Phase 5.13)

COM_STMT_PREPARE compiles the SQL once (parse + analyze). Every subsequent COM_STMT_EXECUTE reuses the compiled plan — no re-parsing, no catalog scan:

Path	Per-execute cost
`COM_QUERY` (plain string)	parse + analyze + execute (~5 ms)
`COM_STMT_EXECUTE` — plan valid	substitute params + execute (~0.1 ms) — 50× faster
`COM_STMT_EXECUTE` — after DDL	re-analyze once, then fast path resumes

Schema invalidation (correctness guarantee): after ALTER TABLE, DROP TABLE, CREATE INDEX, etc., the cached plan is re-analyzed automatically on the next execute. The schema_version counter in Database increments on every successful DDL; each connection polls it lock-free (Arc<AtomicU64>) before each execute.

LRU eviction: each connection caches up to max_prepared_stmts_per_connection (default 1024) compiled plans. The least-recently-used plan is evicted silently when the limit is reached. Configurable in axiomdb.toml.

WAL Fsync Pipeline (6.19, closed with a documented gap)

Phase 6.19 replaced the old timer-based CommitCoordinator with an always-on leader-based WAL fsync pipeline. The runtime behavior changed, but the key single-connection autocommit benchmark remains a documented gap.

Measured with:

python3 benches/comparison/local_bench.py --scenario insert_autocommit --rows 1000 --table --engines axiomdb

Current result:

Benchmark	AxiomDB	Target	Status
`insert_autocommit`	224 ops/s	`>= 5,000 ops/s`	❌

⚙️

Design Decision — Good Primitive, Wrong Arrival Pattern MariaDB's group_commit_lock inspired the leader-based pipeline and it does remove the old timer window. But under a strict MySQL request/response client, the server still waits for durability before sending OK, so the next statement cannot arrive while the fsync is in flight. The batching primitive is therefore correct, but it does not solve the sequential single-client benchmark by itself.

End-to-End INSERT Throughput

Full pipeline: parse → analyze → execute → WAL → MmapStorage. Measured with executor_e2e benchmark (MmapStorage + real WAL, release build, Apple M2 Pro NVMe).

Configuration	AxiomDB	MariaDB ~	Status
INSERT 10K rows / N separate SQL strings / 1 txn	35K rows/s	140K rows/s	⚠️
INSERT 10K rows / 1 multi-row SQL string	211K rows/s	140K rows/s	✅ 1.5× faster
INSERT autocommit (1 visible commit/stmt, wire protocol)	224 q/s	—	⚠️ (closed subphase, open perf gap)

🚀

Performance Advantage vs MariaDB InnoDB With INSERT INTO t VALUES (r1),(r2),...,(rN), AxiomDB reaches 211K rows/s vs MariaDB's ~140K rows/s — 1.5× faster on bulk inserts. The gap comes from three combined optimizations: O(P) heap writes via HeapChain::insert_batch, O(1) WAL writes via record_insert_batch (Phase 3.17), and a single parse+analyze pass for all N rows (Phase 4.16c). MariaDB pays a clustered B-Tree insert per row plus UNDO log write before each page modification.

How to achieve this throughput in your application:

-- Fast: one SQL string with N value rows (211K rows/s)
INSERT INTO orders (user_id, amount) VALUES
  (1, 49.99), (2, 12.50), (3, 99.00), -- ... up to thousands of rows
  (1000, 7.99);

-- Slower: N separate INSERT strings (35K rows/s — parse+analyze per row)
INSERT INTO orders VALUES (1, 49.99);
INSERT INTO orders VALUES (2, 12.50);
-- ...

The difference between the two approaches is 6× in throughput. The bottleneck in the per-string case is parse + analyze overhead per SQL string (~20 µs/string), not the storage write.

Four-Engine Native Benchmark (2026-03-24)

All four engines measured locally on Apple M2 Pro, same machine, no Docker overhead, 10,000-row table (id BIGINT AUTO_INCREMENT PRIMARY KEY, name VARCHAR(100), value INT). Each engine was given equivalent hardware resources.

Engines tested:

MariaDB 12.1 — port 3306
MySQL 8.0 — port 3310
PostgreSQL 16 — port 5433
AxiomDB — port 3309

Operation	MariaDB 12.1	MySQL 8.0	PostgreSQL 16	AxiomDB
INSERT batch (10K rows, 1 stmt)	558 ms · 18K r/s	628 ms · 16K r/s	786 ms · 13K r/s	275 ms · 36K r/s
SELECT * (10K rows, full scan)	62 ms · 162K r/s	53 ms · 189K r/s	4 ms · 2.3M r/s	47 ms · 212K r/s
DELETE (no WHERE, 10K rows)	31 ms · 323K r/s	407 ms · 25K r/s	47 ms · 212K r/s	9.6 ms · 1M r/s

INSERT batch — 2× faster than MariaDB

AxiomDB reaches 36K r/s vs MariaDB’s 18K r/s (2× faster) and MySQL’s 16K r/s (2.25× faster). The gap comes from the same three optimizations described above: HeapChain::insert_batch() (O(P) page writes), record_insert_batch() (O(1) WAL write), and a single parse+analyze pass for all N rows.

SELECT * — on par with MySQL, 11× behind PostgreSQL

AxiomDB SELECT (212K r/s) is marginally faster than MySQL 8.0 (189K r/s) and on par with the full-pipeline expectation. PostgreSQL’s 2.3M r/s reflects its shared buffer pool: after the first scan, all 10K rows fit in PostgreSQL’s hot in-memory buffer and subsequent queries never touch disk. AxiomDB’s mmap approach relies on the OS page cache for the same effect — the gap closes when pages are hot, but PostgreSQL’s buffer pool gives it an edge on repeated same-connection scans because it bypasses the OS cache layer entirely.

DELETE (no WHERE) — 3× faster than MariaDB, 40× faster than MySQL

AxiomDB deletes 10,000 rows in 9.6 ms (1M r/s). MariaDB takes 31 ms; MySQL 8.0 takes 407 ms. The AxiomDB advantage comes from two optimizations working together:

WalEntry::Truncate — a single 51-byte WAL entry replaces 10,000 per-row Delete entries. MySQL InnoDB writes one undo log record per row before marking it deleted — for 10K rows this is 10K undo writes plus 10K page modifications.
HeapChain::delete_batch() — groups deletions by page, reads each page once, marks all slots dead, writes back once. 10K rows across 50 pages = 100 page operations instead of 30,000.

🚀

3× Faster Full-Table DELETE Than MariaDB, 40× Faster Than MySQL 8.0 DELETE without WHERE on 10K rows: AxiomDB 9.6 ms (1M r/s) vs MariaDB 31 ms (323K r/s) vs MySQL 8.0 407 ms (25K r/s). The gap is structural: MySQL InnoDB writes one undo log entry per row and pins each page in the buffer pool individually. AxiomDB emits one WalEntry::Truncate and processes all deletions in O(P) page I/O where P = number of pages ≈ 50 for 10K rows.

Row Codec Throughput

Operation	Throughput	Notes
Encode row	33M rows/s	5-column row, mixed types
Decode row	28M rows/s	Same row layout
encoded_len()	O(n) no alloc	Only computes the size, no buffer

Row encoding is fast because:

The codec iterates values once with a fixed dispatch per type.
The null bitmap is written as bytes with bit shifts — no per-column branch on NULL.
Variable-length types (Text, Bytes) use a 3-byte length prefix that avoids the 4-byte overhead of a full u32.

Why AxiomDB Is Fast — Architecture Reasons

1. No Double-Buffering

MySQL InnoDB maintains its own Buffer Pool in addition to the OS page cache. The same data lives in RAM twice.

MySQL:   Disk → OS page cache → InnoDB Buffer Pool → Query
                (copy 1)            (copy 2)

AxiomDB: Disk → OS page cache → Query
                (mmap — single copy)

AxiomDB uses mmap to map the .db file directly. The OS page cache IS the buffer. When a page is hot, it is served from L2/L3 cache with zero copies.

2. No Double-Write Buffer

MySQL writes each 16 KB page to a special “doublewrite buffer” area on disk before writing it to its actual location. This prevents torn-page corruption but costs two disk writes per page.

AxiomDB uses a WAL + per-page CRC32c checksum. The WAL record is small (tens of bytes for the changed key-value pair). On recovery, AxiomDB replays the WAL to reconstruct any page that has a checksum mismatch. No doublewrite buffer needed.

3. Lock-Free Concurrent Reads

The Copy-on-Write B+ Tree uses an AtomicU64 to store the root page ID. Readers load the root pointer with Acquire semantics and traverse the tree without acquiring any lock. Writers swap the root pointer with Release semantics after finishing the copy chain.

A running SELECT does not stall any INSERT or UPDATE. Both proceed in parallel.

4. Async I/O with Tokio

The server mode uses Tokio async I/O. 1,000 concurrent connections run on approximately 8 OS threads. MySQL’s thread-per-connection model requires 1,000 OS threads for 1,000 connections, consuming ~8 GB in stack space alone.

Performance Budget

The following table defines the minimum acceptable performance for each critical operation. Benchmarks that fall below the “acceptable maximum” column are treated as blockers before any phase is closed.

Operation	Target	Acceptable maximum
Point lookup (PK)	800K ops/s	600K ops/s
Range scan 10K rows	45 ms	60 ms
B+ Tree INSERT with WAL (storage only)	180K ops/s	150K ops/s
INSERT end-to-end 10K batch (Phase 8)	180K ops/s	150K ops/s
SELECT via wire protocol (autocommit)	—	—
INSERT via wire protocol (autocommit)	—	—
Sequential scan 1M rows	0.8 s	1.2 s
Concurrent reads ×16	linear	<2× degradation
Parser (simple SELECT)	600 ns	1 µs
Parser (complex SELECT)	3 µs	6 µs

Index Usage Guide

Rules of Thumb

Every foreign key column needs an index — AxiomDB does not auto-index FK columns. Without an index, every FK check during DELETE/UPDATE scans the child table linearly.
Put the most selective column first in composite indexes — A query filtering WHERE user_id = 42 AND status = 'paid' benefits most from (user_id, status) if user_id is more selective (fewer distinct values match).
Covering indexes eliminate heap lookups — If all columns in a SELECT are in the index, AxiomDB returns results directly from the index without touching heap pages.
Partial indexes reduce size — CREATE INDEX ... WHERE deleted_at IS NULL indexes only active rows. If 90% of rows are soft-deleted, the partial index is 10× smaller than a full index.
BIGINT AUTO_INCREMENT beats UUID v4 for PK — UUID v4 inserts at random positions in the B+ Tree, causing ~40% more page splits than sequential integers. Use UUID v7 if you need UUIDs (time-sortable prefix).

Query Patterns to Avoid

Unindexed range scans on large tables

-- Slow: scans every row in orders (no index on placed_at)
SELECT * FROM orders WHERE placed_at > '2026-01-01';

-- Fix: create the index
CREATE INDEX idx_orders_date ON orders (placed_at);

Leading wildcard LIKE

-- Slow: cannot use index on 'name' (leading %)
SELECT * FROM users WHERE name LIKE '%smith%';

-- Better: full-text search index (planned Phase 8)
-- Acceptable workaround for small tables: use LOWER() + LIKE on indexed column

SELECT * with wide rows

-- Fetches all columns including large TEXT blobs for every row
SELECT * FROM documents WHERE category_id = 5;

-- Better: select only what the UI needs
SELECT id, title, created_at FROM documents WHERE category_id = 5;

NOT IN with nullable subquery

-- Returns 0 rows if the subquery contains a single NULL
SELECT * FROM orders WHERE user_id NOT IN (SELECT id FROM banned_users);

-- Fix: filter NULLs explicitly
SELECT * FROM orders WHERE user_id NOT IN (
    SELECT id FROM banned_users WHERE id IS NOT NULL
);

Measuring Performance

EXPLAIN (planned)

EXPLAIN SELECT * FROM orders WHERE user_id = 42 ORDER BY placed_at DESC;

Running the Built-in Benchmarks

# B+ Tree benchmarks
cargo bench --bench btree -p axiomdb-index

# Storage engine benchmarks
cargo bench --bench storage -p axiomdb-storage

# Compare before/after an optimization
cargo bench -- --save-baseline before
# ... make change ...
cargo bench -- --baseline before

Benchmarks use Criterion.rs and report mean, standard deviation, and throughput in a format compatible with critcmp for historical comparison.

Optimization Results — All-Visible Flag + Prefetch (2026-03-24)

Two storage-level optimizations implemented on branch research/pg-internals-comparison, inspired by PostgreSQL internals analysis:

All-Visible Page Flag (optim-A)

After the first sequential scan on a stable table (all rows committed, none deleted), AxiomDB sets bit 0 of PageHeader.flags. Subsequent scans skip per-slot MVCC visibility tracking for those pages — 1 flag check per page instead of N per-slot comparisons.

Impact on DELETE: scan_rids_visible() (used before batch delete) goes faster because most pages are all-visible after INSERT → COMMIT. Measured improvement on 10K-row DELETE: 10ms → 7ms (+30%).

Sequential Scan Prefetch Hint (optim-C)

MmapStorage now calls madvise(MADV_SEQUENTIAL) before every sequential heap scan. The OS kernel begins async read-ahead for following pages, overlapping I/O with processing of the current page.

Impact: Measurable on cold-cache workloads (pages not in OS page cache). No regression on warm cache.

Benchmark after both optimizations (wire protocol, Apple M2 Pro)

Operation	MariaDB 12.1	MySQL 8.0	AxiomDB	PostgreSQL 16 (warm)
INSERT batch 10K	150ms · 67K r/s	301ms · 33K r/s	278ms · 36K r/s	737ms · 14K r/s
SELECT * 10K	53ms · 188K r/s	48ms · 208K r/s	49ms · 206K r/s	5ms · 2.1M r/s
DELETE 10K (no WHERE)	13ms · 779K r/s	102ms · 98K r/s	7ms · 1.4M r/s	6ms · 1.6M r/s

🚀

Performance Advantage AxiomDB DELETE (no WHERE) at 1.4M rows/s outperforms MariaDB (779K r/s) by 1.8× and MySQL 8.0 (98K r/s) by 14×. The combination of WalEntry::Truncate (1 WAL entry instead of N) and the all-visible flag (skips MVCC scan overhead) eliminates the two main costs in full-table deletion.

Architecture Overview

AxiomDB is organized as a Cargo workspace of purpose-built crates. Each crate has a single responsibility and depends only on crates below it in the stack. The layering prevents circular dependencies and makes each component independently testable.

Layer Diagram

┌─────────────────────────────────────────────────────────────────────┐
│                          ENTRY POINTS                               │
│                                                                     │
│  axiomdb-server        axiomdb-embedded                             │
│  (TCP daemon,          (Rust API + C FFI,                           │
│   MySQL wire protocol)  in-process library)                         │
└──────────────────────────────┬──────────────────────────────────────┘
                               │
┌──────────────────────────────▼──────────────────────────────────────┐
│                        NETWORK LAYER                                │
│                                                                     │
│  axiomdb-network                                                    │
│  └── mysql/                                                         │
│      ├── codec.rs    (MySqlCodec — 4-byte packet framing)           │
│      ├── packets.rs  (HandshakeV10, HandshakeResponse41, OK, ERR)   │
│      ├── auth.rs     (mysql_native_password SHA1 + caching_sha2_password)│
│      ├── charset.rs  (charset/collation registry, encode_text/decode_text)│
│      ├── session.rs  (ConnectionState — typed charset fields,       │
│      │               prepared stmt cache, pending long data)        │
│      ├── handler.rs  (handle_connection — async task per TCP conn)  │
│      ├── result.rs   (QueryResult → result-set packets, charset-aware)│
│      ├── error.rs    (DbError → MySQL error code + SQLSTATE)        │
│      └── database.rs (Arc<RwLock<Database>> wrapper)                │
└──────────────────────────────┬──────────────────────────────────────┘
                               │
┌──────────────────────────────▼──────────────────────────────────────┐
│                       QUERY PIPELINE                                │
│                                                                     │
│  axiomdb-sql                                                        │
│  ├── lexer     (logos DFA, zero-copy tokens)                        │
│  ├── parser    (recursive descent, LL(1)/LL(2))                     │
│  ├── ast       (Stmt, Expr, SelectStmt, InsertStmt, ...)            │
│  ├── analyzer  (BindContext, col_idx resolution, catalog lookup)    │
│  ├── eval      (expression evaluator, three-valued NULL logic,      │
│  │              CASE WHEN searched + simple form, short-circuit)    │
│  ├── result    (QueryResult, ColumnMeta, Row — executor return type)│
│  ├── table     (TableEngine — heap DML; clustered guard rails today)│
│  ├── index_integrity (startup index-vs-heap verifier; skips clustered)│
│  └── executor/ (mod.rs facade + select/insert/update/delete/ddl/   │
│                 join/aggregate/shared modules; same execute() API; │
│                 GROUP BY + HAVING + ORDER BY + LIMIT/OFFSET +      │
│                 INSERT … SELECT)                                   │
│                                                                     │
│  [query planner, optimizer — Phase 6]                               │
└──────────────────────────────┬──────────────────────────────────────┘
                               │
┌──────────────────────────────▼──────────────────────────────────────┐
│                    TRANSACTION LAYER                                │
│                                                                     │
│  axiomdb-mvcc          (TxnManager, snapshot isolation, SSI)        │
│  axiomdb-wal           (WalWriter, WalReader, crash recovery)       │
│  axiomdb-catalog       (CatalogBootstrap, CatalogReader, schema)    │
└──────────────────────────────┬──────────────────────────────────────┘
                               │
┌──────────────────────────────▼──────────────────────────────────────┐
│                     INDEX LAYER                                     │
│                                                                     │
│  axiomdb-index         (BTree CoW, RangeIter, prefix compression)   │
└──────────────────────────────┬──────────────────────────────────────┘
                               │
┌──────────────────────────────▼──────────────────────────────────────┐
│                    STORAGE LAYER                                    │
│                                                                     │
│  axiomdb-storage       (StorageEngine trait, MmapStorage,           │
│                         MemoryStorage, FreeList, heap pages)        │
└──────────────────────────────┬──────────────────────────────────────┘
                               │
┌──────────────────────────────▼──────────────────────────────────────┐
│                     TYPE FOUNDATION                                 │
│                                                                     │
│  axiomdb-types         (Value, DataType, row codec)                 │
│  axiomdb-core          (DbError, RecordId, TransactionSnapshot,     │
│                         PageId, LsnId, common types)               │
└─────────────────────────────────────────────────────────────────────┘
                               │
                    ┌──────────▼────────┐
                    │   axiomdb.db      │  ← mmap pages (16 KB each)
                    │   axiomdb.wal     │  ← WAL append-only log
                    └───────────────────┘

Crate Responsibilities

axiomdb-core

The dependency-free foundation. Contains:

DbError — the single error enum used by all other crates, using thiserror
dsn — shared DSN parser and typed normalized output:
- ParsedDsn
- WireEndpointDsn
- LocalPathDsn
RecordId — physical location of a row: (page_id: u64, slot_id: u16), 10 bytes
TransactionSnapshot — snapshot ID and visibility predicate for MVCC
PageId, LsnId — type aliases that document intent

No crate in the workspace depends on a crate above axiomdb-core.

⚙️

Shared DSN Core Borrowing PostgreSQL libpq's parsing boundary and SQLite's reusable URI-normalization idea, AxiomDB parses DSNs once in axiomdb-core and lets each consumer validate only the subset it actually supports. This avoids duplicating URI logic in both axiomdb-server and axiomdb-embedded.

axiomdb-types

SQL value representation and binary serialization:

Value — the in-memory enum (Null, Bool, Int, BigInt, Real, Decimal, Text, Bytes, Date, Timestamp, Uuid)
DataType — schema descriptor for a column’s type (mirrors axiomdb-core::DataType but with full type system including parameterized types)
encode_row / decode_row — binary codec from &[Value] to &[u8] and back
encoded_len — O(n) size computation without allocation

axiomdb-storage

The raw page I/O layer:

StorageEngine trait — read_page, write_page, alloc_page, free_page, flush
MmapStorage — maps the .db file with memmap2; pages are directly accessible as &Page references into the mapped region
MemoryStorage — Vec<Page> in RAM for tests and in-memory databases
FreeList — bitmap tracking free pages; scans left-to-right for the first free bit
Page — 16 KB struct with 64-byte header (magic, type, checksum, page_id, LSN, free_start, free_end) and 16,320-byte body
Heap page format — slotted page with null bitmap and tuples growing from the end toward the beginning
Same-slot tuple rewrite helpers — used by the stable-RID UPDATE path to overwrite a row in place when the new encoded row still fits inside the existing slot

axiomdb-index

The Copy-on-Write B+ Tree:

BTree — the public tree type; wraps a StorageEngine and an AtomicU64 root
RangeIter — lazy iterator for range scans; traverses the tree to cross leaf boundaries
InternalNodePage / LeafNodePage — #[repr(C)] structs with bytemuck::Pod for zero-copy serialization
prefix module — CompressedNode for in-memory prefix compression of internal keys

axiomdb-wal

Append-only Write-Ahead Log:

WalWriter — appends WalEntry records with CRC32c checksums; manages file header
WalReader — stateless; opens a file handle per scan; supports both forward and backward iteration (backward scan uses entry_len_2 at the tail of each record)
WalEntry — binary-serializable record with LSN, txn_id, entry type, table_id, key, old_value, new_value, and checksum
EntryType::UpdateInPlace — stable-RID same-slot UPDATE record used by rollback and crash recovery to restore the old tuple image at the same (page_id, slot_id)
Crash recovery state machine — CRASHED → RECOVERING → REPLAYING_WAL → VERIFYING → READY

axiomdb-catalog

Schema persistence and lookup:

CatalogBootstrap — creates the three system tables (axiom_tables, axiom_columns, axiom_indexes) in the meta page on first open
CatalogReader — reads schema from the system tables for use by the analyzer and executor; uses a TransactionSnapshot for MVCC-consistent reads
Schema types: TableDef, ColumnDef, IndexDef
TableDef now carries root_page_id plus TableStorageLayout::{Heap, Clustered}
CatalogWriter::create_table_with_layout(...) allocates either a heap or clustered table root

⚙️

Design Decision — DDL First, DML Later Phase 39.13 makes the catalog and `CREATE TABLE` clustered-aware before touching generic executor DML. That keeps the storage rewrite honest: the engine now records the real physical table layout, and old heap code is forced to fail fast instead of silently operating on the wrong root type.

axiomdb-mvcc

Transaction management and snapshot isolation:

TxnManager — assigns transaction IDs, tracks active transactions, assigns snapshots on BEGIN
RowHeader — embedded in each heap row: (xmin, xmax, deleted) for visibility
MVCC visibility function — determines whether a row version is visible to a snapshot

axiomdb-sql

The SQL processing pipeline:

lexer — logos-based DFA; ~85 tokens; zero-copy &'src str identifiers
ast — all statement types: SelectStmt, InsertStmt, UpdateStmt, DeleteStmt, CreateTableStmt, CreateIndexStmt, DropTableStmt, DropIndexStmt, AlterTableStmt
expr — Expr enum for the expression tree: BinaryOp, UnaryOp, Column, Literal, IsNull, Between, Like, In, Case, Function, Param { idx: usize } (positional ? placeholder resolved at execute time)
parser — recursive descent; expression sub-parser with full operator precedence; parses GROUP BY, HAVING, ORDER BY with NULLS FIRST/LAST, LIMIT/OFFSET, SELECT DISTINCT, INSERT … SELECT, and both forms of CASE WHEN
analyzer — BindContext / BoundTable; resolves col_idx for JOINs
eval/ — directory module rooted at eval/mod.rs; exports the same evaluator API as before, but splits internals into context.rs (collation and subquery runners), core.rs (recursive Expr evaluation), ops.rs (comparisons, boolean logic, IN, LIKE), and functions/ (scalar built-ins by family)
result — QueryResult enum (Rows / Affected / Empty), ColumnMeta (name, data_type, nullable, table_name), Row = Vec<Value>; the contract between the executor and all callers (embedded API, wire protocol, CLI)
index_integrity — startup-time verification that compares every catalog-visible index against heap-visible rows after WAL recovery and rebuilds readable divergent indexes before open returns; clustered tables are currently skipped because their PRIMARY KEY metadata reuses the clustered root
executor/ — directory module rooted at executor/mod.rs; the facade still exports execute, execute_with_ctx, and last_insert_id_value, but the implementation is now split into shared.rs, select.rs, joins.rs, aggregate.rs, insert.rs, update.rs, delete.rs, bulk_empty.rs, ddl.rs, and staging.rs. Capabilities remain the same: GROUP BY with hash-based aggregation (COUNT(*), COUNT(col), SUM, MIN, MAX, AVG with proper NULL exclusion), HAVING post-filter, ORDER BY with multi-column sort keys and per-column NULLS FIRST/LAST control, LIMIT n OFFSET m for pagination, SELECT DISTINCT with NULL-equality dedup (two NULL values are considered equal for deduplication), and INSERT … SELECT for bulk copy and aggregate materialization
clustered tables now enter the catalog through CREATE TABLE ... PRIMARY KEY ...
39.14 adds a dedicated clustered INSERT branch in executor/insert.rs
39.15 adds a dedicated clustered SELECT branch in executor/select.rs
39.16 adds a dedicated clustered UPDATE branch in executor/update.rs
39.17 adds a dedicated clustered DELETE branch in executor/delete.rs
39.18 adds clustered VACUUM maintenance in axiomdb-sql/src/vacuum.rs
39.19 adds legacy heap→clustered rebuild in executor/ddl.rs
Stable-RID UPDATE fast path — same-slot heap rewrite that preserves RecordId when the new encoded row fits and makes untouched-index skipping safe
UPDATE apply fast path — indexed UPDATE now batches candidate heap reads, filters no-op rows before heap mutation, batches UpdateInPlace WAL append, and groups per-index delete+insert/root persistence on the remaining rows
Transactional INSERT staging — explicit transactions can buffer consecutive INSERT ... VALUES rows in SessionContext, then flush them through one grouped heap/index pass at the next barrier statement or COMMIT
Indexed multi-row INSERT batch path — the immediate INSERT ... VALUES (...), (... ) path now reuses the same grouped physical apply helpers as staged flushes even when the table has PRIMARY KEY or secondary indexes; the immediate path keeps strict same-statement UNIQUE checking and therefore does not reuse the staged committed_empty shortcut
clustered INSERT branch — explicit-PK tables now bypass heap staging entirely, derive PK bytes from clustered primary-index metadata, write directly through clustered_tree, maintain clustered secondary bookmarks, and make rollback delete undo keys from the current catalog root instead of trusting stale pre-split roots
clustered rebuild branch — legacy heap+PRIMARY KEY tables now rebuild into a fresh clustered root, rebuild secondaries as PK-bookmark indexes, flush those new roots, then swap catalog metadata and defer old-page free until commit

⚙️

Design Decision — Split Without API Drift PostgreSQL and SQLite both keep executor logic separated by statement family instead of one source file. AxiomDB now adopts the same responsibility split, but keeps the existing `crate::executor` facade intact so sibling modules and external callers do not pay a refactor tax.

⚙️

Design Decision — Share Batch Apply, Not Bulk-Load Semantics PostgreSQL's heap_multi_insert() and DuckDB's appender both inspired the shared grouped-write layer. AxiomDB adapts that physical apply pattern, but rejects reusing the staged bulk-load shortcut on immediate multi-row INSERT because duplicate keys inside one SQL statement must still fail atomically and before any partial batch becomes visible.

axiomdb-network

The MySQL wire protocol implementation. Lives in crates/axiomdb-network/src/mysql/:

Module	Responsibility
`codec.rs`	`MySqlCodec` — `tokio_util` framing codec; reads/writes the 4-byte header (`u24 LE` payload length + `u8` sequence ID)
`packets.rs`	Builders for HandshakeV10, HandshakeResponse41, OK, ERR, EOF; length-encoded integer/string helpers
`auth.rs`	`gen_challenge` (20-byte CSPRNG), `verify_native_password` (SHA1-XOR), `is_allowed_user` allowlist
`charset.rs`	Static charset/collation registry; `decode_text`/`encode_text` using `encoding_rs`; supports utf8mb4, utf8mb3, latin1 (cp1252), binary
`session.rs`	`ConnectionState` — typed `client_charset`, `connection_collation`, `results_collation` fields; `SET NAMES`; `decode_client_text`/`encode_result_text`
`handler.rs`	`handle_connection` — async task per TCP connection; explicit `CONNECTED → AUTH → IDLE → EXECUTING → CLOSING` lifecycle
`result.rs`	`serialize_query_result` — `QueryResult` → `column_count + column_defs + EOF + rows + EOF` packets; charset-aware row encoding
`error.rs`	`dberror_to_mysql` — maps every `DbError` variant to a MySQL error code + SQLSTATE
`database.rs`	`Database` wrapper — owns storage + txn, runs WAL recovery and startup index verification, exposes `execute_query`

Connection lifecycle

TCP accept
  │
  ▼  (seq 0)
Server → HandshakeV10
  │       20-byte random challenge, capabilities, server version
  │       auth_plugin_name = "caching_sha2_password"
  │
  ▼  (seq 1)
Client → HandshakeResponse41
  │       username, auth_response (SHA1-XOR token or caching_sha2 token),
  │       capabilities, auth_plugin_name
  │
  ▼  (seq 2)  — two paths depending on the plugin negotiated:
  │
  │  mysql_native_password path:
  │  └── Server → OK  (permissive mode: username in allowlist → accepted)
  │
  │  caching_sha2_password path (MySQL 8.0+ default):
  │  ├── Server → AuthMoreData(0x03)  ← fast_auth_success indicator
  │  ├── Client → empty ack packet    ← pymysql sends this automatically
  │  └── Server → OK
  │
  ▼  COMMAND LOOP
  │
  ├── COM_QUERY (0x03)        → parse SQL → intercept? → execute → result packets
  ├── COM_PING  (0x0e)        → OK
  ├── COM_INIT_DB (0x02)      → updates current_database in ConnectionState + OK
  ├── COM_RESET_CONNECTION (0x1f) → resets ConnectionState, preserves transport lifecycle metadata + OK
  ├── COM_STMT_PREPARE (0x16) → parse SQL with ? placeholders → stmt_ok packet
  ├── COM_STMT_SEND_LONG_DATA (0x18) → append raw bytes to stmt-local buffers, no reply
  ├── COM_STMT_EXECUTE (0x17) → merge long data + decode params → substitute → execute → result packets
  ├── COM_STMT_RESET (0x1a)   → clear stmt-local long-data state → OK
  ├── COM_STMT_CLOSE (0x19)   → remove from cache, no response
  └── COM_QUIT  (0x01)        → close

Explicit lifecycle state machine (5.11c)

5.11c moved transport/runtime concerns out of ConnectionState into mysql/lifecycle.rs. ConnectionState still owns SQL session variables, prepared statements, warnings, and session counters. ConnectionLifecycle owns only:

current transport phase
client capability flags relevant to lifecycle policy
timeout policy per phase
socket-level configuration (TCP_NODELAY, SO_KEEPALIVE)

Phase	Entered when	Timeout policy
`CONNECTED`	socket accepted, before first packet	no read yet; greeting write uses auth timeout
`AUTH`	handshake/auth exchange starts	fixed 10s auth timeout for reads/writes
`IDLE`	between commands	`interactive_timeout` if `CLIENT_INTERACTIVE`, otherwise `wait_timeout`
`EXECUTING`	after a command packet is accepted	packet writes use `net_write_timeout`; any future in-flight reads use `net_read_timeout`
`CLOSING`	`COM_QUIT`, EOF, timeout, or transport error	terminal state before handler return

⚙️

Design Decision — Transport State Split MariaDB and PostgreSQL both separate connection lifecycle from SQL session semantics. AxiomDB adopts the same boundary: timeout and socket policy live in `ConnectionLifecycle`, while `ConnectionState` remains purely SQL/session state.

COM_RESET_CONNECTION recreates ConnectionState::new() and resets session timeout variables to their defaults, but it does not recreate ConnectionLifecycle. That means the connection remains interactive or non-interactive according to the original handshake, even after reset.

Prepared statements (prepared.rs)

Prepared statements allow a client to send SQL once and execute it many times with different parameters, avoiding repeated parsing and enabling binary parameter encoding that is more efficient than string escaping.

Protocol flow:

Client → COM_STMT_PREPARE  (SQL with ? placeholders)
  │
Server reads the SQL, counts ? placeholders, assigns a stmt_id.
  │
Server → Statement OK packet
  │       stmt_id: u32
  │       num_columns: u16  (columns in the result set, or 0 for DML)
  │       num_params:  u16  (number of ? placeholders)
  │       followed by num_params parameter-definition packets + EOF
  │       followed by num_columns column-definition packets + EOF
  │
Client → COM_STMT_SEND_LONG_DATA (optional, repeatable)
  │       stmt_id: u32
  │       param_id: u16
  │       raw chunk bytes
  │
Server appends raw bytes to stmt-local state, sends no response.
  │
Client → COM_STMT_EXECUTE
  │       stmt_id: u32
  │       flags: u8  (0 = CURSOR_TYPE_NO_CURSOR)
  │       iteration_count: u32  (always 1)
  │       null_bitmap: ceil(num_params / 8) bytes  (one bit per param)
  │       new_params_bound_flag: u8  (1 = type list follows)
  │       param_types: [u8; num_params * 2]  (type byte + unsigned flag)
  │       param_values: binary-encoded values for non-NULL params
  │
Server → result set packets  (same text-protocol format as COM_QUERY)
  │
Client → COM_STMT_CLOSE (stmt_id)   — no response expected

Binary parameter decoding (decode_binary_value):

Each parameter is decoded according to its MySQL type byte:

MySQL type byte	Type name	Decoded as
`0x01`	TINY	`i8` → `Value::Int`
`0x02`	SHORT	`i16` → `Value::Int`
`0x03`	LONG	`i32` → `Value::Int`
`0x08`	LONGLONG	`i64` → `Value::BigInt`
`0x04`	FLOAT	`f32` → `Value::Real`
`0x05`	DOUBLE	`f64` → `Value::Real`
`0x0a`	DATE	4-byte packed date → `Value::Date`
`0x07` / `0x0c`	TIMESTAMP / DATETIME	7-byte packed datetime → `Value::Timestamp`
`0xfd` / `0xfe` / `0x0f`	VAR_STRING / STRING / VARCHAR	lenenc bytes → `Value::Text`
`0xf9` / `0xfa` / `0xfb` / `0xfc`	TINY_BLOB / MEDIUM_BLOB / LONG_BLOB / BLOB	lenenc bytes → `Value::Bytes`

NULL parameters are identified by the null-bitmap before the type list is read; they produce Value::Null without consuming any bytes from the value region.

Long-data buffering (COM_STMT_SEND_LONG_DATA):

PreparedStatement owns stmt-local pending buffers:

#![allow(unused)]
fn main() {
pub struct PreparedStatement {
    // ...
    pub pending_long_data: Vec<Option<Vec<u8>>>,
    pub pending_long_data_error: Option<String>,
}
}

Rules:

chunks are appended as raw bytes in handler.rs
COM_STMT_SEND_LONG_DATA never takes the Database mutex
the next COM_STMT_EXECUTE consumes pending long data before inline values
long data wins over both the inline execute payload and the null bitmap
state is cleared immediately after every execute attempt
COM_STMT_RESET clears only this long-data state, not the cached plan

AxiomDB follows MariaDB’s COM_STMT_SEND_LONG_DATA model here: accumulate raw bytes per placeholder and decode them only at execute time. That keeps chunked multibyte text correct without dragging the command through the engine path.

Parameter substitution — AST-level plan cache (substitute_params_in_ast):

COM_STMT_PREPARE runs parse + analyze once and stores the resulting Stmt in PreparedStatement.analyzed_stmt. On each COM_STMT_EXECUTE, substitute_params_in_ast walks the cached AST and replaces every Expr::Param { idx } node with Expr::Literal(params[idx]) in a single O(n) tree walk (~1 µs), then calls execute_stmt() directly — bypassing parse and analyze entirely.

The ? token is recognized by the lexer as Token::Question and emitted by the parser as Expr::Param { idx: N } (0-based position). The semantic analyzer passes Expr::Param through unchanged because the type is not yet known; type resolution happens at execute time once the binary-encoded parameter values are decoded from the COM_STMT_EXECUTE packet.

value_to_sql_literal converts each decoded Value to the appropriate Expr::Literal variant:

Value::Null → Expr::Literal(Value::Null)
Value::Int / BigInt / Real → numeric literal node
Value::Text → text literal node (single-quote escaping preserved at the protocol boundary, not needed in the AST)
Value::Date / Timestamp → date/timestamp literal node

⚙️

Design Decision — AST cache vs string substitution The initial prepared-statement implementation substituted parameters by replacing ? markers in the original SQL text and then running the full parse + analyze pipeline on each COM_STMT_EXECUTE call (~1.5 µs per execution). Phase 5.13 replaces this with an AST-level plan cache: parse + analyze run once at COM_STMT_PREPARE time; each execute performs only a tree walk to splice in the decoded parameter values (~1 µs). MySQL and PostgreSQL use the same strategy — parsing and planning are separated from execution precisely so that repeated executions avoid repeated parse overhead.

⚙️

Text-Protocol Response for Prepared Statement Results COM_STMT_EXECUTE responses use the same text-protocol result-set format as COM_QUERY (column defs + EOF + text-encoded rows + EOF), not the MySQL binary result-set format. The binary result-set format requires a separate CLIENT_PS_MULTI_RESULTS serialization path for every column type and adds substantial protocol complexity with marginal benefit for typical workloads. The text-protocol response is fully accepted by PyMySQL, SQLAlchemy, and the mysql CLI. Binary result-set serialization is deferred to subphase 5.5a when a concrete performance need arises.

ConnectionState — per-connection session state:

#![allow(unused)]
fn main() {
pub struct ConnectionState {
    pub current_database: String,
    pub autocommit: bool,
    // Typed charset state — negotiated at handshake, updated by SET NAMES
    client_charset: &'static CharsetDef,
    connection_collation: &'static CollationDef,
    results_collation: &'static CollationDef,
    pub variables: HashMap<String, String>,
    pub prepared_statements: HashMap<u32, PreparedStatement>,
    pub next_stmt_id: u32,
}
}

The three charset fields are typed references into the static charset.rs registry. from_handshake_collation_id(id: u8) initializes all three from the collation id the client sends in the HandshakeResponse41 packet. Unsupported ids are rejected before auth with ERR 1115 (ER_UNKNOWN_CHARACTER_SET). SET NAMES <charset> updates all three; individual SET character_set_client = … updates only the relevant field.

decode_client_text(&[u8]) -> Result<String, DbError> decodes inbound SQL/identifiers. encode_result_text(&str) -> Result<Vec<u8>, DbError> encodes outbound text columns. Both are non-lossy — they return DbError::InvalidValue rather than replacement characters.

⚙️

Design Decision The engine stays UTF-8 internally. Only the MySQL wire boundary gains transcoding — a clean transport-charset layer. This is the same approach PostgreSQL uses with its client_encoding / server-encoding split, but without the per-column collation complexity that PostgreSQL adds. All AxiomDB storage is UTF-8; charset negotiation is purely a wire-layer concern.

#![allow(unused)]
fn main() {
pub struct PreparedStatement {
    pub stmt_id: u32,
    pub sql_template: String,            // original SQL with ? placeholders
    pub param_count: u16,
    pub analyzed_stmt: Option<Stmt>,     // cached parse+analyze result (plan cache)
    pub compiled_at_version: u64,        // global schema_version at compile time
    pub deps: PlanDeps,                  // per-table OID dependencies (Phase 40.2)
    pub generation: u32,                 // incremented on each re-analysis
    pub last_used_seq: u64,
    pub pending_long_data: Vec<Option<Vec<u8>>>,
    pub pending_long_data_error: Option<String>,
}
}

analyzed_stmt is populated by COM_STMT_PREPARE after parse + analyze succeed. On COM_STMT_EXECUTE, if analyzed_stmt is Some, the handler calls substitute_params_in_ast on the cached Stmt and invokes execute_stmt() directly, skipping the parse and analyze steps entirely. If analyzed_stmt is None (should not occur in normal operation), the handler falls back to the full parse + analyze path.

OID-based staleness check (Phase 40.2):

COM_STMT_EXECUTE uses a two-level check:

Fast (O(1) atomic compare): if compiled_at_version == current_global_schema_version, no DDL has occurred since compile → skip catalog scan entirely (zero I/O).
Slow (O(t) catalog reads, t = tables in deps): only when the global version has advanced. PlanDeps::is_stale() reads each table’s current schema_version from the catalog heap and compares to the cached snapshot. If all match → the DDL was on a different table → stamp the new global version and skip re-analysis.

This avoids re-analyzing prepared statements when CREATE INDEX ON other_table runs — only statements that actually reference the DDL-modified table are re-compiled. PostgreSQL uses the same approach via RelationOids in CachedPlanSource.

Each connection maintains its own HashMap<u32, PreparedStatement>. Statement IDs are assigned by incrementing next_stmt_id (starting at 1) and are local to the connection — the same ID on two connections refers to two different statements. COM_STMT_CLOSE removes the entry; subsequent COM_STMT_EXECUTE calls for the closed ID return an Unknown prepared statement error. COM_STMT_RESET leaves the entry in place and clears only the stmt-local long-data buffers plus any deferred long-data error.

Packet framing and size enforcement (codec.rs — subphase 5.4a)

Every MySQL message in both directions — client to server and server to client — uses the same 4-byte envelope:

[payload_length: u24 LE] [sequence_id: u8] [payload: payload_length bytes]

MySqlCodec implements tokio_util::codec::{Decoder, Encoder}. It holds a configurable max_payload_len (default 64 MiB) that matches the session variable @@max_allowed_packet.

Two-phase decoder algorithm:

Scan phase — walk physical packet headers without consuming bytes, accumulating total_payload. If total_payload > max_payload_len, return MySqlCodecError::PacketTooLarge { actual, max } before any buffer allocation. If any fragment is missing, return Ok(None) (backpressure).
Consume phase — advance the buffer and return (seq_id, Bytes). For a single physical fragment this is a zero-copy split_to into the existing BytesMut. For multi-fragment logical packets one contiguous BytesMut is allocated with capacity = total_payload to avoid per-fragment copies.

Multi-packet reassembly. MySQL splits commands larger than 16,777,215 bytes (0xFF_FFFF) across multiple physical packets. A fragment with payload_length = 0xFF_FFFF signals continuation; the final fragment has payload_length < 0xFF_FFFF. The limit applies to the reassembled logical payload, not to each individual fragment.

Live per-connection limit. handle_connection calls reader.decoder_mut().set_max_payload_len(n):

After auth (from conn_state.max_allowed_packet_bytes())
After a valid SET max_allowed_packet = N
After COM_RESET_CONNECTION (restores DEFAULT_MAX_ALLOWED_PACKET)

Oversize behavior. On PacketTooLarge, the handler sends MySQL ERR 1153 / SQLSTATE 08S01 (“Got a packet bigger than ‘max_allowed_packet’ bytes”) and breaks the connection loop. The stream is never re-used — re-synchronisation after an oversize packet is unsafe.

⚙️

Design Decision — Framing-layer enforcement The limit is enforced in MySqlCodec::decode(), before the payload reaches UTF-8 decoding, SQL parsing, or binary-protocol decoding. MySQL 8 and MariaDB enforce max_allowed_packet at the network I/O layer for the same reason: a SQL parser that receives an oversized payload has already spent memory allocating it. Rejecting at the codec boundary means zero heap allocation for oversized inputs.

Result set serialization (result.rs — subphase 5.5a)

AxiomDB has two result serializers sharing the same column_count + column_defs + EOF framing but differing in row encoding:

Serializer	Used for	Row format
`serialize_query_result`	`COM_QUERY`	Text protocol — NULL = `0xfb`, values as lenenc ASCII strings
`serialize_query_result_binary`	`COM_STMT_EXECUTE`	Binary protocol — null bitmap + fixed-width/lenenc values

Both paths produce the same packet sequence shape:

column_count   (lenenc integer)
column_def_1   (lenenc strings: catalog, schema, table, org_table, name, org_name
                + 12-byte fixed section: charset, display_len, type_byte, flags, decimals)
…
column_def_N
EOF
row_1
…
row_M
EOF

Binary row packet layout:

0x00                      row header (always)
null_bitmap[ceil((N+2)/8)]  MySQL offset-2 null bitmap: column i → bit (i+2)
value_0 ... value_k         non-null values in column order (no per-cell headers)

The null bitmap uses MySQL’s prepared-row offset of 2 — bits 0 and 1 are reserved. Column 0 → bit 2, column 1 → bit 3, and so on.

Binary cell encoding per type:

AxiomDB type	Encoding
`Bool`	1 byte: `0x00` or `0x01`
`Int`	4-byte signed LE
`BigInt`	8-byte signed LE
`Real`	8-byte IEEE-754 LE (`f64`)
`Decimal`	lenenc ASCII decimal string (exact, no float rounding)
`Text`	lenenc UTF-8 bytes
`Bytes`	lenenc raw bytes (no UTF-8 conversion)
`Date`	`[4][year u16 LE][month u8][day u8]`
`Timestamp`	`[7][year u16 LE][month][day][h][m][s]` or `[11][...][micros u32 LE]`
`Uuid`	lenenc canonical UUID string

Column type codes (shared between both serializers):

AxiomDB type	MySQL type byte	MySQL name
`Int`	`0x03`	LONG
`BigInt`	`0x08`	LONGLONG
`Real`	`0x05`	DOUBLE
`Decimal`	`0xf6`	NEWDECIMAL
`Text`	`0xfd`	VAR_STRING
`Bytes`	`0xfc`	BLOB
`Bool`	`0x01`	TINY
`Date`	`0x0a`	DATE
`Timestamp`	`0x07`	TIMESTAMP
`Uuid`	`0xfd`	VAR_STRING

⚙️

Design Decision — Single column-definition builder Both the text and binary serializers share one build_column_def() function and one datatype_to_mysql_type() mapping. This guarantees that the type byte in column metadata always agrees with the wire encoding of the row values. A divergence (e.g., advertising LONGLONG but sending ASCII digits) would cause silent data corruption on the client — a class of bug that is impossible when there is only one mapping.

COM_QUERY OID-based plan cache (plan_cache.rs — Phase 40.2)

Repeated ad-hoc queries like SELECT * FROM users WHERE id = 42 arrive with different literal values on each call. The plan cache normalizes literals to ? placeholders, hashes the result, and caches the fully analyzed AST. Subsequent queries with the same structure (e.g., id = 99) skip parse + analyze (~5 µs) and reuse the cached Stmt.

Entry structure (CachedPlanSource):

#![allow(unused)]
fn main() {
struct CachedPlanSource {
    stmt: Stmt,                             // fully analyzed AST
    deps: PlanDeps,                         // (table_id, schema_version) per referenced table
    param_count: usize,                     // expected literal count for structural match
    generation: u32,                        // incremented on each re-store after stale eviction
    exec_count: u64,                        // lifetime hit counter
    last_used_seq: u64,                     // LRU clock value
    last_validated_global_version: u64,     // fast pre-check stamp
}
}

Two-level staleness check:

Fast (O(1)): if global_schema_version == last_validated_global_version, no DDL has occurred since last validation → cache hit with zero catalog I/O.
Slow (O(t) catalog reads): called only when the global version advanced. PlanDeps::is_stale() reads each table’s current schema_version from the catalog heap and compares to the cached snapshot. If any dep mismatches → evict. If all match → stamp the new global version (future lookups hit the fast path again).

Belt-and-suspenders invalidation:

Lazy (primary): is_stale() at lookup time catches cross-connection DDL.
Eager (secondary): invalidate_table(table_id) called immediately after same-connection DDL removes all entries whose deps include table_id. DDL functions in executor/ddl.rs also call bump_table_schema_version(table_id) via CatalogWriter so the per-table counter advances regardless of which connection holds the plan.

OID dependency extraction (plan_deps.rs):

extract_table_deps(stmt, catalog_reader, database) walks the analyzed Stmt and resolves every table reference to its (TableId, schema_version) at compile time:

SELECT — FROM, JOINs, scalar subqueries in WHERE/HAVING/columns/ORDER BY/GROUP BY
INSERT … SELECT — target table + all tables in the SELECT
UPDATE, DELETE — target table + subqueries in WHERE
EXPLAIN — recursive into the wrapped statement
DDL statements — return empty PlanDeps (never cached)

LRU eviction: when max_entries (512) is reached, the entry with the lowest last_used_seq is evicted. O(n) scan over ≤512 entries — called only on capacity overflow, never on the hot lookup path.

🚀

Fine-grained invalidation vs MySQL/PostgreSQL MySQL 8 invalidates its entire prepared-statement cache on any DDL. PostgreSQL's plancache.c uses per-entry RelationOids to limit invalidation to plans that reference the modified table. AxiomDB mirrors PostgreSQL's approach: a CREATE INDEX ON users(email) evicts only plans that reference users — plans on orders, products, and other tables survive untouched.

ORM query interception (handler.rs)

MySQL drivers and ORMs send several queries automatically before any user SQL: SET NAMES, SET autocommit, SELECT @@version, SELECT @@version_comment, SELECT DATABASE(), SELECT @@sql_mode, SELECT @@lower_case_table_names, SELECT @@max_allowed_packet, SHOW WARNINGS, SHOW DATABASES.

intercept_special_query matches these by prefix/content and returns pre-built packet sequences without touching the engine. Without this interception, most clients fail to connect because they receive ERR packets for mandatory queries.

ON_ERROR session behavior (executor.rs, database.rs, subphase 5.2c)

ON_ERROR is implemented as one typed session enum shared by both layers that own statement execution:

Layer	State owner	Responsibility
SQL executor	`SessionContext.on_error`	Controls rollback policy for executor-time failures
Wire/session layer	`ConnectionState.on_error`	Exposes `SET on_error`, `@@on_error`, `SHOW VARIABLES`, and reset semantics

This split is required by the current AxiomDB architecture. handler.rs intercepts SET and SELECT @@var before the engine, but database.rs owns the full parse -> analyze -> execute_with_ctx pipeline. A wire-only flag would leave embedded execution inconsistent; an executor-only flag would make the MySQL session variables lie.

Execution modes:

Mode	Active transaction error	First failing DML with `autocommit=0`	Parse/analyze failure
`rollback_statement`	rollback to statement boundary, txn stays open	full rollback, txn closes	return ERR, txn state unchanged
`rollback_transaction`	eager full rollback, txn closes	eager full rollback, txn closes	eager full rollback if txn active
`savepoint`	same as `rollback_statement`	keep implicit txn open after rolling back the failing DML	return ERR, txn state unchanged
`ignore`	ignorable SQL errors -> warning + continue; non-ignorable runtime errors -> eager full rollback + ERR	ignorable SQL errors -> warning + continue; non-ignorable runtime errors -> eager full rollback + ERR	same split as active txn

ignore reuses the existing SHOW WARNINGS path. For ignorable SQL/user errors, database.rs maps the original DbError to the corresponding MySQL warning code/message and returns QueryResult::Empty, which the serializer turns into an OK packet with warning_count > 0. For non-ignorable errors (DiskFull, WAL failures, storage/runtime corruption), the error still surfaces as ERR and the transaction is eagerly rolled back if one is active.

⚙️

Borrowed Savepoint Model AxiomDB borrows the "statement as anonymous savepoint" idea from MariaDB and SQLite, but adapts PostgreSQL's fail-fast use case into eager rollback instead of a persistent aborted-transaction latch. That keeps MySQL compatibility where it matters while avoiding a second long-lived txn state machine in the Phase 5 wire path.

SHOW STATUS — server and session counters (status.rs, subphase 5.9c)

MySQL clients, ORMs, and monitoring tools (PMM, Datadog MySQL integration, ProxySQL) call SHOW STATUS on connect or periodically to query server health. Returning an error or empty result breaks these integrations.

Counter architecture:

Two independent counter stores keep telemetry decoupled from correctness:

Store	Type	Scope	Reset policy
`StatusRegistry`	`Arc<StatusRegistry>` with `AtomicU64` fields	Server-wide, shared across all connections	Only on server restart
`SessionStatus`	Plain `u64` fields inside `ConnectionState`	Per-connection	On `COM_RESET_CONNECTION` (which recreates `ConnectionState`)

Database owns an Arc<StatusRegistry>. Each handle_connection task clones the Arc once at connect time — the same pattern used by schema_version. The SHOW STATUS intercept never acquires the Database mutex; it reads directly from the cloned Arc<StatusRegistry> and the local SessionStatus. This means the query cannot block other connections.

RAII guards:

#![allow(unused)]
fn main() {
// Increments threads_connected +1 after auth; drops −1 on disconnect (even on error).
let _connected_guard = ConnectedGuard::new(Arc::clone(&status));

// Increments threads_running +1 for the duration of COM_QUERY / COM_STMT_EXECUTE.
let _running = RunningGuard::new(&status);
}

threads_connected and threads_running are always accurate with no manual bookkeeping because Rust’s drop guarantees run on early returns and panics.

Counters tracked:

Variable name	Scope	Description
`Bytes_received`	Session + Global	Bytes received from client (payload + 4-byte header)
`Bytes_sent`	Session + Global	Bytes sent to client
`Com_insert`	Session + Global	`INSERT` statement count
`Com_select`	Session + Global	`SELECT` statement count
`Innodb_buffer_pool_read_requests`	Global	Best-effort mmap access counter
`Innodb_buffer_pool_reads`	Global	Physical page reads (compatibility alias)
`Questions`	Session + Global	All statements executed (any command type)
`Threads_connected`	Global	Active authenticated connections
`Threads_running`	Session + Global	Connections actively executing a command
`Uptime`	Global	Seconds since server start

SHOW STATUS syntax:

All four MySQL-compatible forms are intercepted before hitting the engine:

SHOW STATUS
SHOW SESSION STATUS
SHOW LOCAL STATUS
SHOW GLOBAL STATUS
-- Any of the above with LIKE filter:
SHOW STATUS LIKE 'Com_%'
SHOW GLOBAL STATUS LIKE 'Threads%'

LIKE filtering reuses like_match from axiomdb-sql (proper % / _ wildcard semantics, case-insensitive against variable names). Results are always returned in ascending alphabetical order.

🚀

Lock-Free Status Reads SHOW STATUS reads AtomicU64 counters directly from a cloned Arc — it never acquires the Database mutex. MySQL InnoDB reads status from the engine layer, which requires acquiring internal mutexes under high concurrency. AxiomDB's design means monitoring queries cannot interfere with query execution at any load level.

DB lock strategy

The MySQL handler stores the opened engine in Arc<tokio::sync::RwLock<Database>>.

read-only statements acquire db.read()
mutating statements and transaction control acquire db.write()
multiple reads run concurrently
all writes are still serialized at whole-database granularity

This is the current runtime model. It is more advanced than the old Phase 5 Mutex<Database> design because read-only queries can now overlap, but it is still below MySQL/InnoDB and PostgreSQL for write concurrency because row-level locking is not implemented yet.

⚙️

Next Concurrency Milestone MySQL/InnoDB and PostgreSQL both lock at row granularity for ordinary UPDATE and DELETE statements. AxiomDB's next concurrency step is Phase 13.7 (row-level locking), followed by 13.8 (deadlock detection) and 13.8b (`FOR UPDATE`, `SKIP LOCKED`, `NOWAIT`).

⚙️

Permissive Auth — Phase 5 Design Decision Phase 5 implements the full mysql_native_password SHA1 challenge-response handshake (the same algorithm used by MySQL 5.x clients) but ignores the password result for users in the allowlist (root, axiomdb, admin). This lets any MySQL-compatible client connect during development without credential management. The verify_native_password function is fully correct — it is called and its result logged — but the decision to accept or reject is based solely on the username allowlist until Phase 13 (Security) adds stored credentials and real enforcement.

caching_sha2_password (MySQL 8.0+)

MySQL 8.0 changed the default authentication plugin from mysql_native_password to caching_sha2_password. When a client using the new default (e.g., PyMySQL ≥ 1.0, MySQL Connector/Python, mysql2 for Ruby) connects, the server must complete a 5-packet handshake instead of the 3-packet one:

Seq	Direction	Packet	Notes
0	S → C	HandshakeV10	includes 20-byte challenge
1	C → S	HandshakeResponse41	`auth_plugin_name = "caching_sha2_password"`
2	S → C	AuthMoreData(0x03)	fast_auth_success — byte `0x03` signals that password verification is skipped in permissive mode
3	C → S	empty ack	client acknowledges the fast-auth signal before expecting OK
4	S → C	OK	connection established

The critical implementation detail is that the ack packet at seq=3 must be read before sending OK. If the server sends OK at seq=2 instead, the client has already queued the empty ack packet. The server then reads that empty packet as a COM_QUERY command (command byte 0x00 = COM_SLEEP, or simply an unknown command), which causes the connection to close silently — no error is reported to the application.

⚙️

caching_sha2_password Sequence Number Gotcha MySQL 8.0 clients send an empty ack packet (seq=3) after receiving AuthMoreData(fast_auth_success). If the server skips reading that ack and sends OK immediately at seq=2, the client's buffered ack arrives in the command loop, where it is misread as a COM_QUERY (command byte 0x00 = COM_SLEEP). The connection closes silently with no error visible to the application. The fix is one extra read_packet() call before writing OK.

axiomdb-server

Entry point for server mode. Parses CLI flags (--data-dir, --port), opens the axiomdb-network::Database, starts a Tokio TCP listener, and spawns one handle_connection task per accepted connection, passing each task a clone of the Arc<RwLock<Database>>.

axiomdb-embedded

Entry point for embedded mode. Exposes:

A safe Rust API (Database::open, Database::execute, Database::transaction)
A C FFI (axiomdb_open, axiomdb_execute, axiomdb_close, axiomdb_free_string)

Query Lifecycle — From Wire to Storage

1. TCP bytes arrive on the socket
   │
2. axiomdb-network::mysql::codec::MySqlCodec decodes the 4-byte header
   → (sequence_id, payload)
   │
3. handler.rs inspects payload[0] (command byte)
   ├── 0x01 COM_QUIT  → close
   ├── 0x02 COM_INIT_DB → OK
   ├── 0x0e COM_PING  → OK
   ├── 0x16 COM_STMT_PREPARE → parse + analyze → store in PreparedStatement.analyzed_stmt → stmt_ok
   ├── 0x17 COM_STMT_EXECUTE → substitute_params_in_ast(cached_stmt, params) → execute_stmt() ↓ (step 9)
   └── 0x03 COM_QUERY → continue ↓
   │
4. intercept_special_query(sql) — ORM/driver stubs
   ├── match → return pre-built packet sequence  (no engine call)
   └── no match → continue ↓
   │
5. db.lock() → execute_query(sql, &mut session)
   │
6. axiomdb-sql::tokenize(sql)
   → Vec<SpannedToken>  (logos DFA, zero-copy)
   │
7. axiomdb-sql::parse(tokens)
   → Stmt  (recursive descent; all col_idx = placeholder 0)
   │
8. axiomdb-sql::analyze(stmt, storage, snapshot)
   → Stmt  (col_idx resolved against catalog; names validated)
   │
9. Executor interprets the analyzed Stmt
   → reads from axiomdb-index (BTree lookups / range scans)
   → calls axiomdb-types::decode_row on heap page bytes
   → builds Vec<Vec<Value>> result rows
   │
10. WAL write (for INSERT / UPDATE / DELETE)
    → axiomdb-wal::WalWriter::append(WalEntry)
    │
11. Heap page write (for INSERT / UPDATE / DELETE)
    → axiomdb-storage::StorageEngine::write_page
    │
12. db.lock() released
    │
13. result::serialize_query_result(QueryResult, seq=1)
    → column_count + column_defs + EOF + rows + EOF  (Rows)
    → OK packet with affected_rows + last_insert_id  (Affected)
    │
14. MySqlCodec encodes each packet with 4-byte header → TCP send

For embedded mode, steps 1–4 and 12–14 are replaced by a direct Rust function call that returns a QueryResult struct.

Key Architectural Decisions

mmap over a custom buffer pool

AxiomDB maps the .db file with mmap. The OS page cache manages eviction (LRU) and readahead automatically. InnoDB maintains a separate buffer pool on top of the OS page cache, causing the same data to live in RAM twice. mmap eliminates the second copy.

Trade-off: we give up fine-grained control over eviction policy. The OS uses LRU, which is good for most database workloads. Custom eviction (e.g., clock-sweep with hot/cold separation) will be optional in a future phase.

Copy-on-Write B+ Tree

CoW means a write operation never modifies an existing page in place. Instead, it creates new pages for every node on the path from root to the modified leaf, then atomically swaps the root pointer. Readers who loaded the old root before the swap continue accessing a fully consistent old version with no locking.

Trade-off: writes amplify — modifying one leaf requires copying O(log n) pages. For a tree of depth 4 (enough for hundreds of millions of rows), this is 4 page copies per write. At 16 KB per page, that is 64 KB of write amplification per key insert.

WAL without double-write

The WAL records logical changes (key, old_value, new_value) rather than full page images. Each WAL record has a CRC32c checksum. On recovery, AxiomDB reads the WAL forward, identifies committed transactions, and replays their mutations. Pages with incorrect checksums are rebuilt from WAL records.

This eliminates MySQL’s doublewrite buffer (which writes each page twice to protect against torn writes) at the cost of a slightly more complex recovery algorithm.

logos for lexing, not nom

logos generates a compiled DFA from the token patterns at build time. The generated lexer runs in O(n) time with a fixed, small constant (typically 1–3 CPU instructions per byte). nom builds parser combinators at runtime with dynamic dispatch overhead. For a lexer processing millions of SQL statements per second, the constant factor matters: logos achieves 9–17× throughput over sqlparser-rs’s nom-based lexer.

Storage Engine

The storage engine is the lowest user-accessible layer in AxiomDB. It manages raw 16-kilobyte pages on disk or in memory, provides a freelist for page allocation, and exposes a simple trait that all higher layers depend on.

The StorageEngine Trait

#![allow(unused)]
fn main() {
pub trait StorageEngine: Send + Sync {
    fn read_page(&self, page_id: u64) -> Result<PageRef, DbError>;
    fn write_page(&self, page_id: u64, page: &Page) -> Result<(), DbError>;
    fn alloc_page(&self, page_type: PageType) -> Result<u64, DbError>;
    fn free_page(&self, page_id: u64) -> Result<(), DbError>;
    fn flush(&self) -> Result<(), DbError>;
    fn page_count(&self) -> u64;
    fn prefetch_hint(&self, start_page_id: u64, count: u64) { ... }
    fn set_current_snapshot(&self, snapshot_id: u64) { ... }
    fn deferred_free_count(&self) -> usize { ... }
}
}

All methods take &self — there is no &mut self anywhere in the trait. Mutable state is managed entirely through interior mutability:

write_page: acquires a per-page exclusive RwLock (from PageLockTable) for the duration of the pwrite(2) call. Two transactions writing different pages proceed in full parallelism with zero contention.
alloc_page: acquires Mutex<FreeList> only during the bitmap scan (microseconds), then acquires the page lock to initialise the new page.
free_page: acquires Mutex<FreeList> briefly to add the page to the free bitmap.
flush: acquires Mutex<FreeList> to persist the freelist, then calls fdatasync.

This design mirrors InnoDB (buf_page_get_gen with per-page block_lock, no &mut on the buffer pool) and PostgreSQL (per-buffer atomic state field, MarkBufferDirty is &self-equivalent).

⚙️

Design Decision — Interior Mutability (Phase 40.3) Both InnoDB and PostgreSQL use a &self-equivalent buffer pool with per-page locks. AxiomDB follows the same pattern: a sharded PageLockTable (64 shards, one RwLock<HashMap<u64, Arc<RwLock<()>>>> per shard) eliminates the global &mut self bottleneck and is the architectural unlock for concurrent writer support in phases 40.4–40.12.

read_page returns an owned PageRef — a heap-allocated copy of the 16 KB page data. This is a deliberate change from the original &Page borrow: owned pages survive mmap remaps (during grow()) and page reuse (after free_page), which is essential for concurrent read/write access. The copy cost is ~0.5 us from L2/L3 cache — the same cost PostgreSQL pays when copying a page from the buffer pool into backend-local memory.

Page Format

Every page is exactly PAGE_SIZE = 16,384 bytes (16 KB). The first HEADER_SIZE = 64 bytes are the page header; the remaining PAGE_BODY_SIZE = 16,320 bytes are the body.

Offset   Size   Field            Description
──────── ────── ──────────────── ──────────────────────────────────────
     0      8   magic            `PAGE_MAGIC` — identifies valid pages
     8      1   page_type        PageType enum (see below)
     9      1   flags            page flags (`PAGE_FLAG_ALL_VISIBLE`, future bits)
    10      2   item_count       item/slot count for the page-local format
    12      4   checksum         CRC32c of body bytes `[HEADER_SIZE..PAGE_SIZE]`
    16      8   page_id          This page's own ID (self-identifying)
    24      8   lsn              Log Sequence Number of last write
    32      2   free_start       First free byte offset in the body (format-specific)
    34      2   free_end         Last free byte offset in the body (format-specific)
    36     28   _reserved        Future use
Total:    64 bytes

The CRC32c checksum covers only the page body [HEADER_SIZE..PAGE_SIZE], not the header itself. On every read_page, AxiomDB verifies the checksum and returns DbError::ChecksumMismatch if it fails.

⚙️

Design Decision — CRC32c Plus Separate Doublewrite InnoDB historically coupled torn-page repair to an internal doublewrite area in the system tablespace. AxiomDB keeps the page format simpler: per-page CRC32c detects corruption, and the repair copy lives in a separate `.dw` file instead of inside the main database file.

Page Types

#![allow(unused)]
fn main() {
pub enum PageType {
    Meta              = 0,  // page 0: database header + catalog roots
    Data              = 1,  // heap pages holding table rows
    Index             = 2,  // current fixed-slot B+ Tree internal and leaf nodes
    Overflow          = 3,  // continuation pages for large values
    Free              = 4,  // freelist / unused pages
    ClusteredLeaf     = 5,  // slotted clustered leaf: full PK row inline
    ClusteredInternal = 6,  // slotted clustered internal: varlen separators
}
}

Clustered Page Primitives (Phase 39.1 / 39.2 / 39.3)

The clustered index rewrite is landing in the storage layer first. Two new page types now exist even though the SQL executor still uses the classic heap + secondary-index path:

ClusteredLeaf — slotted page with variable-size cells storing:
- key_len
- row_len
- inline RowHeader
- primary-key bytes
- row payload bytes
ClusteredInternal — slotted page with variable-size separator cells storing:
- right_child
- key_len
- separator key bytes

ClusteredInternal keeps one extra child pointer in the header as leftmost_child, so logical child access still follows the classical B-tree rule n keys -> n + 1 children.

ClusteredInternal body:
  [16B header: is_leaf | num_cells | cell_content_start | freeblock_offset | leftmost_child]
  [cell pointer array]
  [free gap]
  [cells: right_child | key_len | key_bytes]

That design keeps the storage primitive compatible with the current traversal contract:

find_child_idx(search_key) returns the first separator strictly greater than the key
child_at(0) reads leftmost_child
child_at(i > 0) reads the right_child of separator cell i - 1

⚙️

Design Decision — SQLite-Style Slots, B-Tree Semantics SQLite-style slotted pages solve variable-size key storage cleanly, but clustered internal pages still need classic B-tree child semantics. AxiomDB adapts the pattern by storing `leftmost_child` in the header and the remaining children inside separator cells, avoiding the fixed 64-byte key cap of the old `InternalNodePage`.

SQL-Visible Clustered DDL + INSERT Boundary (Phases 39.13 / 39.14)

The storage rewrite is no longer purely internal. CREATE TABLE now uses the clustered root when the SQL definition contains an explicit PRIMARY KEY:

TableDef.root_page_id is the generic primary row-store root
TableDef.storage_layout tells higher layers whether that root is heap or clustered
heap tables still allocate PageType::Data
clustered tables now allocate PageType::ClusteredLeaf
logical PRIMARY KEY metadata on clustered tables points at that same clustered root

The first SQL-visible clustered write paths now exist too:

INSERT on explicit-PRIMARY KEY tables routes directly into clustered_tree::insert(...) or restore_exact_row_image(...)
clustered AUTO_INCREMENT bootstraps from clustered rows instead of heap scans
non-primary clustered indexes are maintained as PK bookmarks through axiomdb-sql::clustered_secondary
SELECT on clustered tables now routes through clustered_tree::lookup(...) / range(...) and decodes clustered secondary bookmarks back into PK probes
UPDATE on clustered tables now routes through clustered candidate discovery plus update_in_place(...) / update_with_relocation(...)
DELETE on clustered tables now routes through clustered candidate discovery plus delete_mark(...) and exact-row-image WAL
pending heap batches flush before the clustered statement boundary so the new clustered branch does not inherit heap staging semantics accidentally

SQL-visible clustered maintenance is now partially live:

clustered VACUUM now physically purges safe dead rows and overflow chains
ALTER TABLE ... REBUILD now migrates legacy heap+PRIMARY KEY tables into a fresh clustered root and rebuilt clustered-secondary bookmark roots
clustered standalone CREATE INDEX / ANALYZE remain later Phase 39 work

Clustered maintenance now includes the first purge path:

VACUUM walks the clustered leaf chain from the leftmost leaf
safe delete-marked cells are physically removed from clustered leaves
overflow chains are freed during that purge
secondary bookmark cleanup uses clustered physical existence after leaf purge, not caller-snapshot visibility
any secondary root rotation caused by delete_many_in(...) is persisted back to the catalog in the same transaction
clustered rebuild flushes the newly built clustered / secondary roots before the catalog swap and defers old heap/index page reclamation until commit

⚙️

Design Decision — Purge by Physical Existence InnoDB purge and PostgreSQL lazy vacuum both separate “row is not visible” from “row is safe to remove”. AxiomDB now applies the same rule to clustered secondaries: cleanup runs only after leaf purge and checks clustered physical existence, so an uncommitted delete cannot orphan a secondary bookmark.

⚙️

Design Decision — No Hidden Clustered Key AxiomDB only creates clustered SQL tables when the schema has an explicit `PRIMARY KEY`. That mirrors SQLite `WITHOUT ROWID` more closely than InnoDB's fallback `GEN_CLUST_INDEX` path and avoids baking a hidden-key compromise into the first clustered executor boundary.

⚙️

Design Decision — No Heap Fallback Writes SQLite WITHOUT ROWID inserts target the PK B-tree directly, and InnoDB treats the clustered key as the row identity. AxiomDB now does the same for SQL-visible clustered INSERT instead of manufacturing a heap row plus compatibility index entry.

Clustered Tree Insert Controller (Phase 39.3)

axiomdb-storage::clustered_tree now builds the first tree-level write path on top of these page primitives. The public entry point is:

#![allow(unused)]
fn main() {
pub fn insert(
    storage: &mut dyn StorageEngine,
    root_pid: Option<u64>,
    key: &[u8],
    row_header: &RowHeader,
    row_data: &[u8],
) -> Result<u64, DbError>
}

The controller is still storage-first:

Bootstrap an empty tree into a ClusteredLeaf root when root_pid is None.
Descend through ClusteredInternal pages with find_child_idx().
Materialize a clustered leaf descriptor:
- small rows stay fully inline
- large rows keep a local prefix inline and spill the tail bytes to overflow pages
Insert that descriptor into the target leaf in sorted key order.
If the descriptor does not fit, defragment once and retry before splitting.
Split leaves by cumulative cell byte volume, not by cell count.
Propagate (separator_key, right_child_pid) upward.
Split internal pages by cumulative separator byte volume and create a new root if the old root overflows.

Split behavior deliberately keeps the old page ID as the left half and allocates only the new right sibling. That matches the current no-concurrent- clustered-writer reality and keeps parent maintenance minimal until the later MVCC/WAL phases wire clustered pages into the full engine.

Since 39.10, rows above the local inline budget are no longer rejected. The leaf keeps the primary key and RowHeader inline, stores only a bounded local row prefix on-page, and spills the remaining tail bytes to a dedicated PageType::Overflow chain.

⚙️

Design Decision — Split By Bytes, Not Count InnoDB and SQLite both have to reason about variable-size leaf contents during page split. AxiomDB follows that constraint directly: clustered leaves and internals split by cumulative encoded bytes, because a 3 KB row and a 40-byte row are not equivalent occupancy units.

Clustered Point Lookup (Phase 39.4)

axiomdb-storage::clustered_tree::lookup(...) is now the first read path over the clustered tree:

#![allow(unused)]
fn main() {
pub fn lookup(
    storage: &dyn StorageEngine,
    root_pid: Option<u64>,
    key: &[u8],
    snapshot: &TransactionSnapshot,
) -> Result<Option<ClusteredRow>, DbError>
}

Lookup flow:

Return None immediately when the tree has no root.
Descend clustered internal pages with find_child_idx() and child_at().
Run exact-key binary search on the target clustered leaf.
Read the leaf descriptor (key, RowHeader, total_row_len, local_prefix, overflow_ptr?).
Apply RowHeader::is_visible(snapshot).
If the row is overflow-backed, reconstruct the logical row bytes by reading the overflow-page chain.
Return an owned ClusteredRow on a visible hit.

In 39.4, lookup is intentionally conservative about invisible rows: when the current inline version fails MVCC visibility, it returns None instead of trying to synthesize an older version. Clustered undo/version-chain traversal for arbitrary snapshots still does not exist; 39.11 adds rollback/savepoint restore for clustered writes, but not older-version reconstruction on reads.

⚙️

Design Decision — Invisible Means Absent For Now PostgreSQL and InnoDB can reconstruct older visible versions because they already have undo/version-chain machinery. AxiomDB deliberately does not fake that in 39.4: even after 39.11 adds rollback-only clustered WAL, an invisible current inline version is still reported as `None` until true older-version reconstruction exists.

Clustered Range Scan (Phase 39.5)

axiomdb-storage::clustered_tree::range(...) is now the first ordered multi-row read path over clustered pages:

#![allow(unused)]
fn main() {
pub fn range<'a>(
    storage: &'a dyn StorageEngine,
    root_pid: Option<u64>,
    from: Bound<Vec<u8>>,
    to: Bound<Vec<u8>>,
    snapshot: &TransactionSnapshot,
) -> Result<ClusteredRangeIter<'a>, DbError>
}

Range flow:

Return an empty iterator when the tree is empty or the bound interval is empty.
For bounded scans, descend to the first relevant leaf with the same clustered internal-page search path used by point lookup.
For unbounded scans, descend to the leftmost leaf.
Start at the first in-range slot within that leaf.
Yield owned ClusteredRow values in primary-key order.
Skip current inline versions that are invisible to the supplied snapshot.
Follow next_leaf to continue the scan across leaves.
Stop immediately when the first key above the upper bound is seen.

The iterator stays lazy: it keeps only the current leaf page id, slot index, bound copies, and snapshot. It does not materialize the whole range into a temporary vector.

⚙️

Design Decision — Seek Once, Then Advance MariaDB's `read_range_first()` / `read_range_next()` and SQLite's `sqlite3BtreeFirst()` / `sqlite3BtreeNext()` both separate “find the first row” from “advance the cursor”. AxiomDB adapts that same shape to clustered storage: one tree descent to the start leaf, then O(1) `next_leaf` traversal per leaf boundary.

When the iterator advances to another leaf, it calls StorageEngine::prefetch_hint(next_leaf_pid, 4). The 4-page window is intentionally conservative: large enough to overlap sequential leaf reads, but small enough not to flood the page cache while clustered scans are still an internal storage primitive.

⚙️

Design Decision — Small Prefetch Window PostgreSQL uses bounded prefetch windows instead of reading arbitrarily far ahead. AxiomDB keeps clustered scan read-ahead at 4 leaves for now: enough to overlap I/O on sequential scans without turning an internal storage walk into a cache-pollution policy.

Like 39.4, this subphase is still honest about missing older-version reconstruction. If a row’s current inline version is invisible, 39.5 skips it; the new 39.11 rollback support does not change read semantics yet.

Zero-Allocation Full Scan (`scan_all_callback`, Phase 39.21)

ClusteredRangeIter::next() allocates two heap buffers per row: cell.key.to_vec() (primary key copy) and reconstruct_row_data (row bytes copy). For a full-table scan that only needs to decode the row bytes into Vec<Value>, both allocations are unnecessary.

scan_all_callback bypasses the iterator entirely:

#![allow(unused)]
fn main() {
pub fn scan_all_callback<F>(
    storage: &dyn StorageEngine,
    root_pid: Option<u64>,
    snapshot: &TransactionSnapshot,
    mut f: F,
) -> Result<(), DbError>
where
    F: FnMut(&[u8], Option<(u64, usize)>) -> Result<(), DbError>,
}

The callback receives (inline_data: &[u8], overflow):

inline_data: a borrow of cell.row_data directly from the leaf page memory — no copy.
overflow: Some((first_overflow_page, tail_len)) for rows that spill to overflow pages; None for rows that fit inline (the common case for most tables).

For inline rows the callback can decode inline_data in place. The caller allocates only one Vec<Value> per visible row — the decoded output — compared to three allocations with the iterator path.

🚀

Performance Advantage — 14× faster than iterator path for aggregate scans A GROUP BY age, AVG(score) query on 50K rows of a clustered table dropped from 57 ms to 4.0 ms (14.25× improvement) after switching from ClusteredRangeIter to scan_all_callback. The bottleneck was ~150K heap allocations per scan (key copy + row copy + Vec<Value>). The callback path eliminates the first two, leaving only the Vec<Value> per row. AxiomDB now runs this query 1.6× faster than MariaDB (6.5 ms) and 2.2× faster than MySQL (8.9 ms) on the same hardware.

Clustered Overflow Pages (Phase 39.10)

Phase 39.10 adds the first overflow-page primitive dedicated to clustered rows:

Leaf cell:
  [key_len: u16]
  [total_row_len: u32]
  [RowHeader: 24B]
  [key bytes]
  [local row prefix]
  [overflow_first_page?: u64]

Overflow page body:
  [next_overflow_page: u64]
  [payload bytes...]

The contract is intentionally physical:

Keep the primary key and RowHeader inline in the clustered leaf.
Keep only a bounded local row prefix inline.
Spill the remaining logical row tail to PageType::Overflow pages.
Reconstruct the full logical row only on read paths (lookup, range) or update paths that need the logical bytes.
Let split / merge / rebalance move the physical descriptor without rewriting the overflow payload.

⚙️

Design Decision — SQLite Tail Spill, InnoDB Scope SQLite's B-tree format keeps a local payload prefix inline and spills only the surplus to overflow pages, while InnoDB restricts off-page storage to clustered records rather than secondary entries. AxiomDB adapts both ideas directly in 39.10: keep clustered row identity and MVCC header inline, spill only the row tail, and keep secondary indexes bookmark-only.

Phase 39.10 itself intentionally did not introduce generic TOAST references, compression, or crash recovery for overflow chains. 39.11 now adds in-process clustered WAL/rollback over those row images, but clustered crash recovery still stays in later phases.

Clustered WAL and Rollback (Phase 39.11)

Phase 39.11 adds the first WAL contract that understands clustered rows:

key       = primary-key bytes
old_value = ClusteredRowImage?   // exact old row image
new_value = ClusteredRowImage?   // exact new row image

Where ClusteredRowImage carries:

the latest clustered root_pid
the exact inline RowHeader
the exact logical row bytes, regardless of whether the row is inline or overflow-backed on page

TxnManager now tracks the latest clustered root per table_id during the active transaction. Rollback and savepoint undo use that root plus two storage helpers:

delete_physical_by_key(...) to undo a clustered insert
restore_exact_row_image(...) to undo clustered delete-mark or update

The restore invariant is logical row state, not exact page topology. Split, merge, or relocate-update may still leave a different physical tree shape after rollback as long as the old primary key, RowHeader, and row bytes are back.

⚙️

Design Decision — WAL Follows PK Identity InnoDB clustered undo also follows clustered-row identity rather than a heap slot. AxiomDB adopts that same constraint in 39.11: clustered pages can defragment and relocate rows, so rollback keys by primary key plus exact row image instead of pretending `(page_id, slot_id)` stays stable.

Phase 39.12 now extends that same contract into clustered crash recovery: open_with_recovery() undoes in-progress clustered writes by PK + exact row image, and open() rebuilds committed clustered roots from surviving WAL history on a clean reopen.

Clustered Update In Place (Phase 39.6)

axiomdb-storage::clustered_tree::update_in_place(...) is now the first clustered-row write path after insert:

#![allow(unused)]
fn main() {
pub fn update_in_place(
    storage: &mut dyn StorageEngine,
    root_pid: Option<u64>,
    key: &[u8],
    new_row_data: &[u8],
    txn_id: u64,
    snapshot: &TransactionSnapshot,
) -> Result<bool, DbError>
}

Update flow:

Return false when the tree is empty, the key is absent, or the current inline version is not visible to the supplied snapshot.
Descend to the owning clustered leaf by primary key.
Build a new inline RowHeader with:
- txn_id_created = txn_id
- txn_id_deleted = 0
- row_version = old.row_version + 1
Materialize a replacement descriptor:
- inline row
- or local-prefix + overflow chain
Ask the leaf primitive to rewrite that exact cell while preserving key order.
Persist the leaf if the rewrite stays inside the same page.
Free the obsolete overflow chain only after a successful physical rewrite.
Return HeapPageFull when the replacement row would require leaving the current leaf.

The leaf primitive has two rewrite modes:

overwrite fast path when the replacement encoded cell fits the existing cell budget
same-leaf rebuild fallback when the row grows, but the leaf can still be rebuilt compactly with the replacement row in place

Neither path changes the primary key, pointer-array order, parent separators, or next_leaf.

⚙️

Design Decision — Same Leaf Or Explicit Failure SQLite has an overwrite optimization for unchanged entry budgets, but AxiomDB stops short of full delete+insert tree surgery in 39.6. If the new row no longer fits in the owning clustered leaf, the engine returns `HeapPageFull` explicitly and leaves structural relocation for later clustered split/overflow phases.

This keeps the subphase honest about what now exists:

clustered insert
clustered point lookup
clustered range scan
clustered same-leaf update
clustered delete-mark

And what still does not:

clustered older-version reconstruction/version chains
clustered root persistence beyond WAL checkpoint/rotation
clustered physical purge
clustered SQL executor integration

⚙️

Design Decision — No Fake Old Versions PostgreSQL HOT chains and InnoDB undo can make an updated row visible to older snapshots. AxiomDB still cannot do that in 39.6, so updating a row rewrites the current inline version only and leaves older-version visibility for later clustered MVCC/version-chain work.

Clustered Delete Mark (Phase 39.7)

axiomdb-storage::clustered_tree::delete_mark(...) now adds the first logical delete path over clustered pages:

#![allow(unused)]
fn main() {
pub fn delete_mark(
    storage: &mut dyn StorageEngine,
    root_pid: Option<u64>,
    key: &[u8],
    txn_id: u64,
    snapshot: &TransactionSnapshot,
) -> Result<bool, DbError>
}

Delete flow:

Return false when the tree is empty, the key is absent, or the current inline version is not visible to the supplied snapshot.
Descend to the owning clustered leaf by primary key.
Build a replacement RowHeader that preserves:
- txn_id_created
- row_version
- _flags and stamps txn_id_deleted = txn_id.
Rewrite the exact clustered cell in place while preserving key bytes and row payload bytes.
Persist the leaf page without changing next_leaf or parent separators.

The important semantic boundary is that clustered delete is currently a header-state transition, not space reclamation. The physical cell stays on the leaf page so snapshots older than the delete can still observe it through the existing RowHeader::is_visible(...) rule.

⚙️

Design Decision — Delete Mark Before Purge InnoDB delete-marks clustered records first and purges them later; PostgreSQL also separates tuple visibility from later vacuum cleanup. AxiomDB follows that same separation in 39.7: stamp `txn_id_deleted` now, defer physical removal to the future clustered purge phase.

Clustered Structural Rebalance (Phase 39.8)

axiomdb-storage::clustered_tree::update_with_relocation(...) adds the first clustered structural-maintenance path:

#![allow(unused)]
fn main() {
pub fn update_with_relocation(
    storage: &mut dyn StorageEngine,
    root_pid: Option<u64>,
    key: &[u8],
    new_row_data: &[u8],
    txn_id: u64,
    snapshot: &TransactionSnapshot,
) -> Result<Option<u64>, DbError>
}

Control flow:

Validate that the replacement row still fits inline on a clustered leaf.
Try update_in_place(...) first.
If the same-leaf rewrite returns HeapPageFull, reload the visible current row and enter the structural path.
Physically remove the exact clustered cell from the tree.
Bubble underfull and min_changed upward:
- repair the parent separator when a non-leftmost child changes its minimum key
- redistribute or merge clustered leaf siblings by encoded byte volume
- redistribute or merge clustered internal siblings while preserving n keys -> n + 1 children
Collapse an empty internal root to its only child.
Reinsert the replacement row with bumped row_version.

The key design boundary is that 39.8 introduces private structural delete only for relocate-update. Public clustered delete is still delete_mark(...), so snapshot-safe purge remains a later concern.

⚙️

Design Decision — Rebalance By Bytes SQLite triggers rebalance from page occupancy, not from a fixed `MIN_KEYS` rule, and InnoDB also reasons about merge feasibility in bytes after page reorganization. AxiomDB adopts that same rule in 39.8: variable-size clustered siblings redistribute and merge by encoded byte volume, not by raw key count.

⚙️

Design Decision — Relocation Still Is Not Undo PostgreSQL and InnoDB can preserve older visible versions through undo/version chains. AxiomDB still cannot do that in 39.8, so relocate-update rewrites only the current inline version and leaves old-version reconstruction for later clustered MVCC/version-chain work.

Current limitations:

delete_mark(...) still keeps dead clustered cells inline; 39.8 does not expose purge to SQL or storage callers yet.
relocate-update still rewrites only the current inline version.
parent separator repair currently assumes the repaired separator still fits in the existing internal page budget; split-on-separator-repair is deferred.

Clustered Secondary Bookmarks (Phase 39.9)

Phase 39.9 adds the first clustered-first secondary-index layout in axiomdb-sql/src/clustered_secondary.rs.

The physical key is:

secondary_logical_key ++ missing_primary_key_columns

Where:

secondary_logical_key is the ordered value vector of the secondary index columns.
missing_primary_key_columns are only the PK columns that are not already present in the secondary key.

That means the physical secondary entry now carries enough information to recover the owning clustered row by primary key without depending on a heap RecordId.

⚙️

Design Decision — Bookmark In The Key InnoDB secondary records carry clustered-key fields, and SQLite `WITHOUT ROWID` appends the table key to secondary indexes. AxiomDB adapts that same idea in 39.9 by embedding the missing PK columns in the physical secondary key instead of inventing a clustered-only side payload.

The dedicated helpers now provide:

layout derivation from (secondary_idx, primary_idx)
encode/decode of bookmark-bearing secondary keys
logical-prefix bounds without a fixed 10-byte RID suffix
insert/delete/update maintenance where relocate-only updates become no-ops if the logical secondary key and primary key stay stable

Current boundary:

this path is not wired into the heap-backed SQL executor yet
FK enforcement and index-integrity rebuilds still use the old RecordId-based secondary path
the legacy RecordId payload in axiomdb-index::BTree remains only a compatibility artifact for this path

MmapStorage — Memory-Mapped File

MmapStorage uses a hybrid I/O model inspired by SQLite: read-only mmap for reads, pwrite() for writes. The mmap is opened with memmap2::Mmap (not MmapMut), making it structurally impossible to write through the mapped region.

Physical file (axiomdb.db):
┌──────────┬──────────┬──────────┬──────────┬──────────┐
│  Page 0  │  Page 1  │  Page 2  │  Page 3  │  ...     │
│ (Meta)   │ (Data)   │ (Index)  │ (Data)   │          │
└──────────┴──────────┴──────────┴──────────┴──────────┘
     ↑           ↑                     ↓
     │           └── read_page(1): copy 16KB from mmap → owned PageRef
     └── mmap (read-only, MAP_SHARED)
                                  write_page(3): pwrite() to file descriptor

Read path: mmap + PageRef copy

read_page(page_id) computes mmap_ptr + page_id * 16384, copies 16 KB into a heap-allocated PageRef, verifies the CRC32c checksum, and returns the owned copy. The copy cost (~0.5 us from L2/L3 cache) is the same price PostgreSQL pays when copying a buffer pool page into backend-local memory.

Write path: pwrite() to file descriptor

write_page(page_id, page) calls pwrite() on the underlying file descriptor at offset page_id * 16384. The mmap (MAP_SHARED) automatically reflects the change on subsequent reads. Note that a 16 KB pwrite() is not crash-atomic on 4 KB-block filesystems — the Doublewrite Buffer protects against torn pages.

Flush: doublewrite + fsync

flush() follows a two-phase write protocol:

Doublewrite phase: all dirty pages (plus pages 0 and 1) are serialized to a .dw file and fsynced. This creates a durable copy of the committed state.
Main fsync: the freelist is pwritten (if modified) and the main .db file is fsynced. If this fsync is interrupted by a crash, the .dw file provides repair data on the next startup.
Cleanup: the .dw file is deleted. If deletion fails, the next open() finds all pages valid and removes it.

⚙️

Design Decision — Read-Only Mmap + pwrite (SQLite Model) No production database uses mmap for writes. PostgreSQL uses pwrite + buffer pool, InnoDB uses pwrite + doublewrite buffer, DuckDB uses pwrite exclusively, and SQLite uses mmap for reads + pwrite for writes. AxiomDB follows the SQLite model: mmap gives zero-copy reads from the OS page cache, while pwrite provides coherent page writes visible through the mmap. A 16 KB pwrite is NOT crash-atomic on 4 KB-block filesystems (ext4, APFS, XFS) — a crash mid-write leaves a torn page. AxiomDB detects torn pages via CRC32c checksums and repairs them from the doublewrite buffer on startup.

🚀

No Double-Buffer Overhead MySQL InnoDB keeps every hot page in RAM twice — once in the OS page cache, once in the InnoDB buffer pool. AxiomDB's mmap approach uses the OS page cache directly. For a working set that fits in RAM, this roughly halves the memory footprint of the storage layer.

Trade-offs:

We cannot control which pages stay hot in memory (the OS uses LRU).
On 32-bit systems, the address space limits the maximum database size. On 64-bit, the address space is effectively unlimited.
PageRef copies add ~0.5 us per page read vs. direct pointer access, but this eliminates use-after-free risks from mmap remap and page reuse.

Deferred Page Free Queue

When free_page(page_id) is called, the page does not return to the freelist immediately. Instead it enters an epoch-tagged queue: deferred_frees: Vec<(page_id, freed_at_snapshot)>. Each entry records the snapshot epoch at which the page became unreachable. release_deferred_frees(oldest_active_snapshot) only releases pages whose freed_at_snapshot <= oldest_active_snapshot — pages freed more recently remain queued because a concurrent reader might still hold a snapshot that references them.

Under the current Arc<RwLock<Database>> architecture, flush() passes u64::MAX (release all) because the writer holds exclusive access and no readers are active. When snapshot slot tracking is added (Phase 7.8), the actual oldest active snapshot will be used instead. The queue is capped at 4096 entries with a tracing warning to detect snapshot leaks.

⚙️

Deferred Frees — Simplified Epoch Reclamation PostgreSQL uses buffer pins to prevent eviction while a backend reads a page. DuckDB uses block reference counts. AxiomDB's deferred free queue achieves the same safety with less complexity: freed pages are quarantined until all concurrent readers that could reference them have completed.

Doublewrite Buffer

A 16 KB pwrite() is not crash-atomic on any modern filesystem with 4 KB internal blocks (APFS, ext4, XFS, ZFS). A power failure mid-write leaves a torn page: the first N×4 KB contain new data, the remainder holds the previous state. CRC32c detects this corruption on startup, but without a repair source the database cannot open.

The doublewrite (DW) buffer solves this. Before every flush(), all dirty pages are serialized to a .dw file alongside the main .db file:

database.db      ← main data file
database.db.dw   ← doublewrite buffer (transient, exists only during flush)

DW File Format

[Header: 16 bytes]
  magic:       "AXMDBLWR"  (8 bytes)
  version:     u32 LE = 1
  slot_count:  u32 LE

[Slots: slot_count × 16,392 bytes each]
  page_id:     u64 LE
  page_data:   [u8; 16384]

[Footer: 8 bytes]
  file_crc:    CRC32c(header || all slots)
  sentinel:    0xDEAD_BEEF

Flush Protocol

1. Collect dirty pages + pages 0 and 1 from the mmap
2. Write all to .dw file → single sequential write
3. fsync .dw file                    ← committed copy durable
4. pwrite freelist to main file
5. fsync main file                   ← main data durable
6. Delete .dw file                   ← cleanup (non-fatal on failure)

Startup Recovery

On MmapStorage::open(), if a .dw file exists:

Validate the DW file (magic, version, size, CRC, sentinel)
For each slot: read the corresponding page from the main file
If CRC is invalid (torn page) → restore from DW copy
fsync the main file → repairs durable
Delete the .dw file

Recovery is idempotent: if interrupted, the DW file is still valid and the next startup reruns recovery. Pages already repaired have valid CRCs and are skipped.

⚙️

Design Decision — Separate DW File (MySQL 8.0.20+ Model) InnoDB originally embedded the doublewrite buffer inside the system tablespace (2 × 64 pages = 2 MB). MySQL 8.0.20 moved it to a separate #ib_*.dblwr file for better sequential I/O and zero impact on the tablespace format. AxiomDB follows this newer approach: the DW file is sequential-write-only, does not change the main file format, and requires no migration for existing databases.

🚀

Orthogonal to WAL — Protects All Pages PostgreSQL's full-page writes (FPW) only protect WAL-logged pages and inflate the WAL by up to 16 KB per page per checkpoint cycle. AxiomDB's doublewrite buffer protects all pages — data, index, meta, and freelist — without changing the WAL format or increasing WAL size. The extra cost is one additional fsync per flush plus 2× write amplification for dirty pages, the same trade-off InnoDB has made since its inception.

Dirty Page Tracking and Targeted Flush

MmapStorage tracks every page written since the last flush() in a PageDirtyTracker (an in-memory HashSet<u64>). On flush(), instead of calling mmap.flush() (which issues msync over the entire file), AxiomDB coalesces the dirty page IDs into contiguous runs and issues one flush_range call per run.

Coalescing algorithm

PageDirtyTracker::contiguous_runs() sorts the dirty IDs and merges adjacent IDs into (start_page, run_length) pairs:

#![allow(unused)]
fn main() {
// Dirty pages: {2, 3, 5, 6, 7}  →  runs: [(2, 2), (5, 3)]
// Byte ranges: [(2*16384, 32768), (5*16384, 49152)]
}

The merge is O(n log n) on the number of dirty pages and produces the minimum number of msync syscalls for any given dirty set.

Freelist integration

When the freelist changes (alloc_page, free_page), freelist_dirty is set. On flush(), the freelist bitmap is serialized into page 1 first, and page 1 is added to the effective flush set even if it was not already in the dirty tracker. Only after all targeted flushes succeed are freelist_dirty and the dirty tracker cleared. A partial failure leaves both intact so the next flush() can retry safely.

🚀

Sub-file msync SQLite and PostgreSQL issue `fsync` over the entire data file on every checkpoint or WAL sync. AxiomDB's targeted `flush_range` (backed by `msync(MS_SYNC)`) touches only the pages that actually changed. On workloads where a small fraction of pages are written per checkpoint, this reduces I/O proportionally to the dirty-page ratio.

Disk-full error classification

Every durable I/O call in flush() (and in create()/grow()) passes its std::io::Error through classify_io() before returning:

#![allow(unused)]
fn main() {
// axiomdb-core/src/error.rs
pub fn classify_io(err: std::io::Error, operation: &'static str) -> DbError {
    // ENOSPC (28) and EDQUOT (69/122) → DbError::DiskFull { operation }
    // All other errors → DbError::Io(err)
}
}

When a DiskFull error propagates out of MmapStorage, the server runtime transitions to read-only degraded mode — all subsequent mutating statements are rejected immediately without re-entering the storage layer.

Invariants

flush() returns Ok(()) only after all dirty pages are durable.
Dirty tracking is cleared only on success — never on failure.
The freelist page (page 1) is always included when freelist_dirty is set, regardless of whether it appears in the tracker.
dirty_page_count() always reflects the count since the last successful flush.
ENOSPC/EDQUOT errors are always surfaced as DbError::DiskFull, never silently wrapped in DbError::Io.

Verified Open — Corruption Detection at Startup

MmapStorage::open() validates every allocated page before making the database available. The startup sequence is:

Map the file and verify page 0 (meta) — magic, version, page count.
Load the freelist from page 1 and verify its checksum.
Scan pages 2..page_count, skipping any page the freelist marks as free. For each allocated page, call read_page_from_mmap() which re-computes the CRC32c of the body and compares it to the stored header.checksum.

#![allow(unused)]
fn main() {
for page_id in 2..page_count {
    if !freelist.is_free(page_id) {
        Self::read_page_from_mmap(&mmap, page_id)?;
    }
}
}

If any page fails, open() returns DbError::ChecksumMismatch { page_id, expected, got } immediately. No connection is accepted and no Db handle is returned.

Free pages are skipped because they are never written by the storage engine and therefore have no valid page header or checksum. Scanning them would produce false positives on a freshly created or partially filled database.

Recovery wiring

Both the network server (Database::open) and the embedded handle (Db::open) route through TxnManager::open_with_recovery() on every reopen:

#![allow(unused)]
fn main() {
let (txn, _recovery) = TxnManager::open_with_recovery(&mut storage, &wal_path)?;
}

This ensures WAL replay runs before the first query is executed, even if the only change in this subphase is the corruption scan. Bypassing open_with_recovery() with the older TxnManager::open() was an oversight that this subphase closes.

⚙️

Scan only allocated pages Free pages contain no valid page header — `file.set_len()` zero-initializes them, giving `checksum = 0`. The CRC32c of an all-zero body is non-zero, so scanning free pages would produce a spurious `ChecksumMismatch` on every fresh or sparsely-used database. The freelist (already in memory by step 2) provides the allocation bitmap at zero extra I/O cost.

MemoryStorage — In-Memory for Tests

MemoryStorage stores pages in a Vec<Box<Page>>. It implements the same StorageEngine trait as MmapStorage. All unit tests for the B+ Tree, WAL, and catalog use MemoryStorage, so they run without touching the filesystem.

#![allow(unused)]
fn main() {
let mut storage = MemoryStorage::new();
let id = storage.alloc_page(PageType::Data)?;
let mut page = Page::new(PageType::Data, id);
page.body_mut()[0] = 0xAB;
page.update_checksum();
storage.write_page(id, &page)?;
let read_back = storage.read_page(id)?;
assert_eq!(read_back.body()[0], 0xAB);
}

FreeList — Page Allocation

The FreeList tracks which pages are free using a bitmap. The bitmap is stored in a dedicated page (or pages, for large databases). Each bit corresponds to one page: 1 = free, 0 = in use.

Allocation

Scans left-to-right for the first 1 bit, clears it, and returns the page ID.

Bitmap: 1110 1101 ...
         ↑
         First free: page 0 (bit 0 = 1)

After allocation: 0110 1101 ...

Deallocation

Sets the bit corresponding to page_id back to 1. Returns DbError::DoubleFree if the bit was already 1 (guard against bugs in the caller).

Invariants

No page appears twice in the freelist.
No page can be both allocated and in the freelist simultaneously.
The freelist bitmap is itself stored in allocated pages (and tracked recursively during bootstrap).

Heap Pages — Slotted Format

Table rows (heap tuples) are stored in PageType::Data pages using a slotted page layout. The slot array grows from the start of the body; tuples grow from the end toward the center.

Body (16,320 bytes):
┌─────────────────────────────────────────────────────────────┐
│ Slot[0] │ Slot[1] │ ... │ free space │ ... │ Tuple[1] │ Tuple[0] │
└──────────────────────────────────────────────────────────────┘
↑                           ↑           ↑
free_start              free area      free_end (decreases)

free_start points to the first unused byte after the last slot entry. free_end points to the first byte of the last tuple written (counting from the end of the body).

SlotEntry — 4 bytes

Offset  Size  Field
     0     2  offset   — byte offset of the tuple within the body (0 = empty slot)
     2     2  length   — total length of the tuple in bytes

A slot with offset = 0 and length = 0 is an empty (deleted) slot. Deleted slots are reused when the page is compacted (VACUUM, planned Phase 9).

RowHeader — 24 bytes

Every heap tuple begins with a RowHeader that stores MVCC visibility metadata:

Offset  Size  Field
     0     8  xmin      — txn_id of the transaction that inserted this row
     8     8  xmax      — txn_id of the transaction that deleted/updated this row (0 = live)
    16     1  deleted   — 1 if this row has been logically deleted
    17     7  _pad      — alignment
Total: 24 bytes

After the RowHeader comes the null bitmap and the encoded column data (see Row Codec).

Null Bitmap in Heap Rows

The null bitmap is stored immediately after the RowHeader. It occupies ceil(n_cols / 8) bytes. Bit i (zero-indexed) being 1 means column i is NULL.

5 columns → ceil(5/8) = 1 byte = 8 bits (bits 5-7 unused, always 0)
11 columns → ceil(11/8) = 2 bytes

Page 0 — The Meta Page

Page 0 is the PageType::Meta page. It is written during database creation (bootstrap) and read during open(). Its body contains:

Offset  Size  Field
     0     8  format_version     — AxiomDB file format version
     8     8  catalog_root_page  — Page ID of the catalog root (axiom_tables B+ Tree root)
    16     8  freelist_root_page — Page ID of the freelist bitmap root
    24     8  next_txn_id        — Next transaction ID to assign
    32     8  checkpoint_lsn     — LSN of the last successful checkpoint
    40   rest _reserved          — Future extensions

On crash recovery, the checkpoint_lsn tells the WAL reader where to start replaying. All WAL entries with LSN > checkpoint_lsn and belonging to committed transactions are replayed.

Batch Delete Operations

AxiomDB implements three optimizations for DELETE workloads that dramatically reduce page I/O and CRC32c computation overhead.

HeapChain::delete_batch()

delete_batch() accepts a slice of (page_id, slot_id) pairs and groups them by page_id before touching any page. For each unique page it reads the page once, marks all targeted slots dead in a single pass, then writes the page back once.

Naive per-row delete path (before delete_batch):
  for each of N rows:
    read_page(page_id)          ← 1 read
    mark slot dead              ← 1 mutation
    update_checksum(page)       ← 1 CRC32c over 16 KB
    write_page(page_id, page)   ← 1 write
  Total: 3N page operations

Batch path (delete_batch):
  group rows by page_id → P unique pages
  for each page:
    read_page(page_id)          ← 1 read
    mark all M slots dead       ← M mutations (M rows on this page)
    update_checksum(page)       ← 1 CRC32c (once per page, not per row)
    write_page(page_id, page)   ← 1 write
  Total: 2P page operations

At 200 rows/page, deleting 10,000 rows hits 50 pages. The naive path requires 30,000 page operations; delete_batch() requires 100.

🚀

300× Fewer Page Operations Than InnoDB Per-Row Buffer Pool Hits MySQL InnoDB processes each DELETE row individually: it pins the page in the buffer pool, applies the undo log entry, updates the row's delete-mark, and releases the pin — once per row. For a 10K-row full-table DELETE, AxiomDB performs 100 page operations (read + write per page); InnoDB performs 10,000+ buffer pool pin/unpin cycles plus 10,000 undo log entries.

mark_deleted() vs delete_tuple() — Splitting Checksum Work

heap::mark_deleted() is an internal function that stamps the slot as dead without recomputing the page checksum. delete_tuple() (the single-row public API) calls mark_deleted() followed immediately by update_checksum() — behavior is unchanged for callers.

The batch path calls mark_deleted() N times (once per slot on a given page), then calls update_checksum() exactly once when all slots on that page are done.

#![allow(unused)]
fn main() {
// Single-row path (public, unchanged):
pub fn delete_tuple(page: &mut Page, slot_id: u16) -> Result<(), DbError> {
    mark_deleted(page, slot_id)?;   // stamp dead
    page.update_checksum();          // 1 CRC32c
    Ok(())
}

// Batch path (called by delete_batch for each page):
for &slot_id in slots_on_this_page {
    mark_deleted(page, slot_id)?;   // stamp dead, no checksum
}
page.update_checksum();             // 1 CRC32c for all N slots on this page
}

⚙️

Design Decision — Deferred Checksum in Batch Paths CRC32c over a 16 KB page costs roughly 4–8 µs on modern hardware. Calling it once per deleted slot instead of once per page wastes N-1 full-page hashes per batch. Splitting mark_deleted from update_checksum makes the cost O(P) in the number of pages, not O(N) in the number of rows. The same split was applied to insert_batch in Phase 3.17.

scan_rids_visible()

HeapChain::scan_rids_visible() is a variant of scan_visible() that returns only (page_id, slot_id) pairs — no row data is decoded or copied.

#![allow(unused)]
fn main() {
pub fn scan_rids_visible(
    &self,
    storage: &dyn StorageEngine,
    snapshot: &TransactionSnapshot,
    self_txn_id: u64,
) -> Result<Vec<(u64, u16)>, DbError>
}

This is used by DELETE without a WHERE clause and TRUNCATE TABLE: both operations need to locate every live slot but neither needs to decode the row’s column values. Avoiding Vec<u8> allocation for each row’s payload cuts memory allocation to near zero for full-table deletes.

HeapChain::clear_deletions_by_txn()

clear_deletions_by_txn(txn_id) is the undo helper for WalEntry::Truncate. It scans the entire heap chain and, for every slot where txn_id_deleted == txn_id, clears the deletion stamp (sets txn_id_deleted = 0, deleted = 0).

This is used during ROLLBACK and crash recovery when a WalEntry::Truncate must be undone. The cost is O(P) page reads and writes for P pages in the chain — identical to a full-table scan. Because recovery and rollback are infrequent relative to inserts and deletes, this trade-off is acceptable (see WAL internals for the corresponding WalEntry::Truncate design decision).

All-Visible Page Flag (Optimization A)

What it is

Bit 0 of PageHeader.flags (PAGE_FLAG_ALL_VISIBLE = 0x01). When set, it asserts that every alive slot on the page was inserted by a committed transaction and none have been deleted. Sequential scans can skip per-slot MVCC txn_id_deleted tracking for those pages entirely.

Inspired by PostgreSQL’s all-visible map (src/backend/storage/heap/heapam.c:668), but implemented as an in-page bit rather than a separate VM file — a single cache-line read suffices.

API

#![allow(unused)]
fn main() {
pub const PAGE_FLAG_ALL_VISIBLE: u8 = 0x01;

impl Page {
    pub fn is_all_visible(&self) -> bool { ... }   // reads bit 0 of flags
    pub fn set_all_visible(&mut self) { ... }       // sets bit 0; caller updates checksum
    pub fn clear_all_visible(&mut self) { ... }     // clears bit 0; caller updates checksum
}
}

Lazy-set during scan

HeapChain::scan_visible() sets the flag after verifying that all alive slots on a page satisfy:

txn_id_created <= max_committed (committed transaction)
txn_id_deleted == 0 (not deleted)

This is a one-time write per page per table lifetime. After the first slow-path scan, every subsequent scan takes the fast path and skips per-slot checks.

Clearing on delete

heap::mark_deleted() clears the flag unconditionally as its very first mutation — before stamping txn_id_deleted. Both changes land in the same update_checksum() + write_page() call. There is no window where the flag is set while a slot is deleted.

Read-only variant for catalog scans

HeapChain::scan_visible_ro() takes &dyn StorageEngine (immutable) and never sets the flag. Used by CatalogReader and other callers that hold only a shared reference. Catalog tables are small (a few pages) and not hot enough to warrant the lazy-set write.

🚀

Performance Advantage After the first scan on a stable table, SELECT skips N per-slot MVCC comparisons (4 u64 comparisons each) and replaces them with 1 bit-check per page. At 200 rows/page, a 10K-row scan goes from 10,000 visibility checks to 50 flag reads.

⚙️

Design Decision In-page bit vs. separate visibility map file (PostgreSQL's approach): the in-page bit requires no additional file I/O and is covered by the existing page checksum. The trade-off is that clearing the flag on any delete requires a page write — the same write already happening for the slot stamp, so no additional I/O is incurred.

Sequential Scan Prefetch Hint (Optimization C)

What it is

StorageEngine::prefetch_hint(start_page_id, count) — a hint method telling the backend that pages starting at start_page_id will be read sequentially. Implementations that do not support prefetch provide a default no-op.

Inspired by PostgreSQL’s read_stream.c adaptive lookahead.

API

#![allow(unused)]
fn main() {
// Default no-op in the trait — all existing backends compile unchanged
fn prefetch_hint(&self, start_page_id: u64, count: u64) {}
}

MmapStorage overrides this with madvise(MADV_SEQUENTIAL) on macOS and Linux:

#![allow(unused)]
fn main() {
#[cfg(any(target_os = "linux", target_os = "macos"))]
fn prefetch_hint(&self, start_page_id: u64, count: u64) {
    // SAFETY: ptr derived from live MmapMut, offset < mmap_len verified,
    // clamped_len <= mmap_len - offset. madvise is a pure hint.
    let _ = unsafe { libc::madvise(ptr, clamped_len, libc::MADV_SEQUENTIAL) };
}
}

count = 0 uses the backend default (PREFETCH_DEFAULT_PAGES = 64, 1 MB).

Call sites

HeapChain::scan_visible(), scan_rids_visible(), and delete_batch() each call storage.prefetch_hint(root_page_id, 0) once before their scan loop. This tells the OS kernel to begin async read-ahead for the pages that follow, overlapping disk I/O with CPU processing of the current page.

When it helps

The hint has measurable impact on cold-cache workloads (data not in OS page cache). On warm cache (mmap pages already faulted in), madvise is accepted but the kernel takes no additional action — no performance regression.

Lazy Column Decode (Optimization B)

What it is

decode_row_masked(bytes, schema, mask) — a variant of decode_row that accepts a boolean mask. When mask[i] == false, the column’s wire bytes are skipped (cursor advanced, no allocation) and Value::Null is placed in the output slot.

Inspired by PostgreSQL’s selective column access in the executor.

API

#![allow(unused)]
fn main() {
pub fn decode_row_masked(
    bytes: &[u8],
    schema: &[DataType],
    mask: &[bool],      // mask.len() must equal schema.len()
) -> Result<Vec<Value>, DbError>
}

For skipped columns:

Fixed-length types (Bool=1B, Int/Date=4B, BigInt/Real/Timestamp=8B, Decimal=17B, Uuid=16B): ensure_bytes is called then pos advances — no allocation.
Variable-length types (Text, Bytes): the 3-byte length prefix is read to advance pos by 3 + len — the payload is never copied or parsed.
NULL columns (bitmap bit set): no wire bytes, cursor unchanged regardless of mask.

Column mask computation

The executor computes the mask via collect_column_refs(expr, mask), which walks the AST and marks every Expr::Column { col_idx } reference. It does not recurse into subquery bodies (different row scope).

SELECT * (Wildcard/QualifiedWildcard) always produces None — decode_row() is used directly with no overhead.

When all mask bits are true, scan_table also uses decode_row() directly.

Where it applies

execute_select_ctx (single-table SELECT): mask covers SELECT list + WHERE + ORDER BY + GROUP BY + HAVING
execute_delete_ctx (DELETE with WHERE): mask covers the WHERE clause only (no-WHERE path uses scan_rids_visible — no decode at all)

Clustered Leaf Page-Buffer Mutation Primitives (Phase 39.22)

Three public primitives in crates/axiomdb-storage/src/clustered_leaf.rs enable zero-allocation in-place UPDATE for fixed-size columns.

`cell_row_data_abs_off`

#![allow(unused)]
fn main() {
pub fn cell_row_data_abs_off(page: &Page, cell_idx: usize) -> Result<(usize, usize), DbError>
}

Computes the absolute byte offset of row_data within the page buffer for a given cell index without decoding the cell. Returns (row_data_abs_off, key_len).

Formula:

row_data_abs_off = HEADER_SIZE + body_off + CELL_META_SIZE + ROW_HEADER_SIZE + key_len

Used by the UPDATE fast path to locate field bytes directly in the page buffer — no cell.row_data.to_vec() required.

`patch_field_in_place`

#![allow(unused)]
fn main() {
pub fn patch_field_in_place(page: &mut Page, field_abs_off: usize, new_bytes: &[u8]) -> Result<(), DbError>
}

Overwrites new_bytes.len() bytes at field_abs_off within the page buffer. Validates that field_abs_off + new_bytes.len() <= PAGE_SIZE. This is the AxiomDB equivalent of InnoDB’s btr_cur_upd_rec_in_place().

🚀

Performance Advantage vs InnoDB / MariaDB InnoDB's btr_cur_upd_rec_in_place writes only changed bytes within the B-tree page buffer. AxiomDB implements the same technique with a pure-Rust zero-unsafe byte-write primitive. For UPDATE t SET score = score + 1 on a 25K-row clustered table, this reduces per-row work from ~469 bytes (full decode + encode + heap alloc) to ~28 bytes (read field + write field), cutting allocations from 5 per row to zero.

`update_row_header_in_place`

#![allow(unused)]
fn main() {
pub fn update_row_header_in_place(page: &mut Page, cell_idx: usize, new_header: &RowHeader) -> Result<(), DbError>
}

Overwrites the 24-byte RowHeader at the exact page offset for a given cell. Used after patch_field_in_place to stamp the new txn_id_created and incremented row_version without re-encoding the full cell.

Split-Phase Pattern (Rust Borrow Checker Compatibility)

The UPDATE fast path uses a split-phase read/write pattern to satisfy the Rust borrow checker — the immutable page borrow (read phase) must be fully dropped before the mutable borrow (write phase) begins:

#![allow(unused)]
fn main() {
// Read phase: immutable borrow — compute field locations, capture old bytes
let (row_data_abs_off, _) = cell_row_data_abs_off(&page, idx)?;
let (field_writes, any_change) = {
    let b = page.as_bytes();
    // ... compute loc, capture old_buf: [u8;8], encode new_buf: [u8;8]
    // MAYBE_NOP: if old_buf[..loc.size] == new_buf[..loc.size] { skip }
    (field_writes_vec, changed)
}; // immutable borrow dropped here

if !any_change { continue; }

// Write phase: mutable borrow — patch page buffer directly
for (field_abs, size, _, new_buf) in &field_writes {
    patch_field_in_place(&mut page, *field_abs, &new_buf[..*size])?;
}
update_row_header_in_place(&mut page, idx, &new_header)?;
}

⚙️

Design Decision — Split Read/Write Borrow Rust's ownership model requires releasing the immutable borrow before taking a mutable one. The split-phase pattern avoids row_data.to_vec() (heap allocation) by keeping field locations in a stack-allocated Vec<(usize, usize, [u8;8], [u8;8])> computed during the immutable phase and consumed during the mutable phase. This is the same invariant InnoDB enforces manually with pointer arithmetic.

WAL and Crash Recovery

The Write-Ahead Log (WAL) is AxiomDB’s durability mechanism. Before any change reaches the storage engine’s pages, a record of that change is appended to the WAL file. On crash recovery, the WAL is replayed to reconstruct any changes that were committed but not yet flushed to the data file.

WAL File Layout

The WAL file starts with a 32-byte file header followed by an unbounded sequence of WAL entries.

File Header — 32 bytes

Offset  Size  Field
     0     4  magic      — 0x57414C4E ("WALN") — identifies a valid WAL file
     4     2  version    — WAL format version (currently 1)
     6    26  _reserved  — Future use

WalReader::open verifies the magic and version before any scan. An incorrect magic returns DbError::WalInvalidHeader.

Entry Binary Format

Each WAL entry is a self-delimiting binary record. The total entry length is stored both at the beginning and at the end to support both forward and backward scanning.

Offset       Size         Field
──────── ─────────── ─────────────────────────────────────────────────────
     0           4   entry_len     u32 LE — total entry length in bytes
     4           8   lsn           u64 LE — Log Sequence Number (globally monotonic)
    12           8   txn_id        u64 LE — Transaction ID (0 = autocommit)
    20           1   entry_type    u8     — EntryType (see below)
    21           4   table_id      u32 LE — table identifier (0 = system operations)
    25           2   key_len       u16 LE — key length in bytes (0 for BEGIN/COMMIT/ROLLBACK)
    27     key_len   key           [u8]   — mutation key bytes (heap RID or clustered PK)
     ?           4   old_val_len   u32 LE — old value length (0 for INSERT, BEGIN, COMMIT, ROLLBACK)
     ?   old_len    old_value      [u8]   — old encoded row (empty on INSERT)
     ?           4   new_val_len   u32 LE — new value length (0 for DELETE, BEGIN, COMMIT, ROLLBACK)
     ?   new_len    new_value      [u8]   — new encoded row (empty on DELETE)
     ?           4   crc32c        u32 LE — CRC32c of all preceding bytes in this entry
     ?           4   entry_len_2   u32 LE — copy of entry_len for backward scan

Minimum size (no key, no values): 4+8+8+1+4+2 + 4+4+4+4 = 43 bytes

Why entry_len_2 at the end

To traverse the WAL backward (during ROLLBACK or crash recovery), the reader needs to find the start of the previous entry given only the current position (end of entry).

entry_start = current_position - entry_len_2

The reader seeks to entry_start, reads entry_len, verifies it equals entry_len_2, then reads the full entry. If the lengths do not match, the entry is corrupt.

⚙️

Design Decision — O(1) Backward Traversal Without an Index Storing entry_len at both ends of every entry enables backward scanning with a single seek per entry — no secondary index or reverse pointer table needed. The cost is 4 bytes per entry (overhead for a WAL with 10M entries: 40 MB, negligible relative to data payload).

Mutation Key Encoding

Heap and clustered mutations do not use the same key contract:

Heap INSERT / UPDATE / DELETE / UpdateInPlace:
  key_len = 10
  key[0..8]  = page_id as u64 LE
  key[8..10] = slot_id as u16 LE

ClusteredInsert / ClusteredDeleteMark / ClusteredUpdate (Phases 39.11 / 39.12):
  key_len = primary_key_bytes.len()
  key     = encoded primary-key bytes

Heap mutations still record the exact page and slot where the row was written, so redo can target the same physical location directly. Clustered mutations do not: clustered pages defragment, split, merge, and relocate rows, so (page_id, slot_id) is not a stable undo key. Their payloads instead store the exact logical row image and the latest clustered root_pid.

⚙️

Design Decision — PK Undo, Not Slot Undo InnoDB's clustered-row undo is keyed by clustered identity rather than a heap-style slot address. AxiomDB adopts the same constraint in 39.11: once clustered rows can relocate inside slotted pages, the only stable rollback key is the primary key plus the exact old row image.

Entry Types

#![allow(unused)]
fn main() {
pub enum EntryType {
    Begin      = 1,  // START of an explicit transaction
    Commit     = 2,  // COMMIT — all preceding entries for this txn_id are durable
    Rollback   = 3,  // ROLLBACK — all preceding entries for this txn_id must be undone
    Insert     = 4,  // INSERT: old_value is empty; new_value is the encoded new row
    Delete     = 5,  // DELETE: old_value is the encoded row before deletion; new_value empty
    Update     = 6,  // UPDATE: both old_value and new_value are present
    Checkpoint = 7,  // CHECKPOINT: marks the LSN up to which pages are flushed to disk
    Truncate   = 8,  // Full-table delete (DELETE without WHERE, TRUNCATE TABLE)
    PageWrite  = 9,  // Bulk insert page image + slot list
    UpdateInPlace = 10, // Stable-RID same-slot update
    ClusteredInsert = 12, // Clustered insert keyed by PK + exact new row image
    ClusteredDeleteMark = 13, // Clustered delete-mark keyed by PK + old/new row image
    ClusteredUpdate = 14, // Clustered update keyed by PK + old/new row image
}
}

Transaction entries (Begin, Commit, Rollback) carry no key or value payload — key_len = 0, old_val_len = 0, new_val_len = 0. The minimum entry size of 43 bytes applies to these records.

PageWrite and UpdateInPlace are physical optimization records. They do not change SQL-visible semantics; they only change how AxiomDB amortizes I/O for common write patterns while preserving rollback and crash recovery guarantees.

WalEntry::Truncate — Full-Table Delete

WalEntry::Truncate (entry type 8) is emitted instead of N individual Delete entries when a statement deletes every row in a table: DELETE FROM t without a WHERE clause, and TRUNCATE TABLE t.

Binary Format

Field           Value
─────────────── ────────────────────────────────────────────────────────
entry_type      8 (Truncate)
table_id        the target table's ID (u32 LE)
key_len         8
key[0..8]       root_page_id of the HeapChain as u64 LE
old_val_len     0 (empty — no per-row data stored)
new_val_len     0 (empty)

The key encodes the heap chain’s root page rather than a single slot, because the undo operation scans the entire chain.

Why One Entry Instead of N

For a 10,000-row table, the per-row path writes 10,000 Delete WAL entries. Each entry carries at minimum 43 bytes of header plus the encoded row payload (old_value), which may be hundreds of bytes. WalEntry::Truncate replaces all N entries with a single 51-byte record (43-byte minimum + 8-byte key).

Per-row Delete path (N = 10,000 rows, avg 100-byte payload):
  WAL entries: 10,000
  WAL bytes written: 10,000 × (43 + 10 + 100) ≈ 1.5 MB

Truncate path:
  WAL entries: 1
  WAL bytes written: 51 bytes

🚀

10,000× Fewer WAL Entries Than MySQL InnoDB for Full-Table DELETE MySQL InnoDB writes one undo log entry per deleted row for every DELETE — including DELETE FROM t without a WHERE clause. For a 10K-row table, InnoDB writes ~10,000 undo records; AxiomDB writes 1 WAL entry. This is the same optimization that MariaDB's storage engine API exposes via ha_delete_all_rows(), but AxiomDB applies it at the WAL level, not just the engine level.

Undo — Rollback and Crash Recovery

Because WalEntry::Truncate stores no per-row state, undo cannot simply replay individual slot reverts from the WAL. Instead, undo calls HeapChain::clear_deletions_by_txn(txn_id), which scans the heap chain and clears the txn_id_deleted stamp on every slot that was deleted by this transaction:

Undo of WalEntry::Truncate for txn_id T:
  for each page in the HeapChain:
    read_page(page_id)
    for each slot on the page:
      if slot.txn_id_deleted == T:
        slot.txn_id_deleted = 0
        slot.deleted = 0
    write_page(page_id, page)

The physical heap is fully restored: all rows that were alive before the DELETE become visible again to transactions with a snapshot predating txn_id T.

⚙️

Design Decision — Undo via Heap Scan, Not Stored Slot List An alternative design would store the list of (page_id, slot_id) pairs inside the Truncate entry itself, enabling O(N) targeted undo without a full scan. We chose the scan approach because: (1) WAL writes are on the critical path of every DELETE; (2) undo (rollback and crash recovery) is rare relative to DELETE frequency; (3) the scan is O(P) in pages, not O(N) in rows, and P ≪ N at 200 rows/page. The trade-off mirrors MariaDB's ha_delete_all_rows() philosophy: optimize the common path (write), accept a bounded cost on the uncommon path (undo).

Crash Recovery Handling

During WAL replay, when the recovery engine encounters WalEntry::Truncate for a committed transaction, it calls HeapChain::delete_batch() with all live slot IDs found by scan_rids_visible() — re-applying the deletion to any pages that may not have been flushed before the crash. If the transaction was not committed (no matching Commit entry in the WAL), the entry is skipped: the heap still contains the pre-delete state because the crash occurred before the commit was durable.

WalEntry::UpdateInPlace — Stable-RID UPDATE

WalEntry::UpdateInPlace (entry type 10) records a same-slot heap rewrite. It is emitted when UPDATE can preserve the original (page_id, slot_id) because the new encoded row still fits in the existing heap slot.

Since 6.20, the executor may emit many UpdateInPlace records through one record_update_in_place_batch(...) call. The on-disk format does not change: the optimization is only in how normal entries are serialized and appended (reserve_lsns(...) + write_batch(...) once per statement instead of one append call per row).

Binary Format

Field           Value
─────────────── ───────────────────────────────────────────────────────────────
entry_type      10 (UpdateInPlace)
table_id        target table ID
key             logical row key carried by the caller
old_value       [page_id:8][slot_id:2][old tuple image...]
new_value       [page_id:8][slot_id:2][new tuple image...]

The tuple image is the full logical row image stored in the slot:

[RowHeader || encoded row bytes]

Undo and crash recovery decode the physical location from the first 10 bytes and then restore the old tuple image directly into the same slot.

Why a New Entry Type Instead of Reusing Update

Classic Update in AxiomDB means logical delete+insert and therefore carries two different physical locations. UpdateInPlace means “same physical location, bytes changed in place”. Reusing Update would blur those two recovery contracts and make undo logic branch on payload shape instead of entry type.

⚙️

Design Decision — Physical Contract Is Explicit PostgreSQL HOT also distinguishes between “new tuple version elsewhere” and “same-page optimization” at the storage-contract level. AxiomDB keeps that distinction explicit in the WAL so recovery can restore the exact old tuple image without guessing which UPDATE shape was used.

Undo and Recovery

Rollback and crash recovery treat UpdateInPlace as a direct restore:

read page(page_id)
restore old tuple image at slot_id
write page(page_id)

If the transaction committed, recovery leaves the rewritten bytes in place. If the transaction did not commit, recovery restores old_value to the same slot.

Clustered Mutation Entries (Phases 39.11 / 39.12)

Phase 39.11 adds the first WAL contract for clustered rows, and Phase 39.12 extends it into clustered crash recovery:

key       = encoded primary-key bytes
old_value = ClusteredRowImage?   // absent on insert
new_value = ClusteredRowImage?   // absent on pure delete undo payload

ClusteredRowImage:
  [root_pid: u64]
  [RowHeader: 24B]
  [row_len: u32]
  [row_data bytes]

TxnManager now tracks the latest clustered root_pid per table_id inside the active transaction. Rollback and ROLLBACK TO SAVEPOINT use that tracked root and clustered-tree helpers:

undo clustered insert → delete_physical_by_key(...)
undo clustered delete-mark / update → restore_exact_row_image(...)

Phases 39.14, 39.16, and 39.17 are the first SQL-visible executor users of that contract:

a fresh clustered SQL insert records ClusteredInsert
reusing a snapshot-invisible delete-marked clustered PK records ClusteredUpdate, because rollback must restore the old tombstone image, not simply delete the new row
clustered SQL update now records the exact old clustered row image before the rewrite, even for same-leaf in-place updates and relocate-updates
clustered SQL delete now records the exact old clustered row image before the delete-mark so rollback/savepoints can restore the prior txn_id_deleted = 0 state exactly
clustered secondary bookmark entries still use the ordinary B+ Tree undo path, but 39.16 extends that undo to both halves of a rewritten secondary key: rollback can delete newly inserted bookmark entries and reinsert the old physical bookmark entry against the current index root

The invariant is intentionally logical: rollback restores the old primary-key row state, not the exact pre-change page topology. A relocate-update may split or merge the tree on the forward path, and rollback may restore the old row into a different physical leaf as long as the visible row state matches the original.

⚙️

Design Decision — Restore State, Not Topology PostgreSQL's B-tree WAL is page-topology-oriented, but that is the wrong first cut for AxiomDB's clustered rewrite because clustered slots are not stable across defragment, split, and merge. 39.11/39.12 therefore restore exact row state by PK and row image instead of trying to replay clustered page topology physically.

39.12 now uses the same payloads during crash recovery:

reverse-undo in-progress clustered inserts by delete_physical_by_key(...)
reverse-undo in-progress clustered delete-mark/update by restore_exact_row_image(...)
track the current clustered root per table while recovery undoes those writes
seed TxnManager::open_with_recovery(...) with the final recovered root map

TxnManager::open(...) also reconstructs the latest committed clustered root per table from surviving WAL history on a clean reopen.

Checkpoint Protocol — 5 Steps

A checkpoint ensures that all dirty pages below a given LSN are written to the .db file so that WAL entries before that LSN can be safely truncated.

Step 1: Write a Checkpoint entry to the WAL with the current LSN.
        This entry marks the start of the checkpoint.

Step 2: Call storage.flush() — ensures all dirty mmap pages are written
        to disk via msync(). After this point, every page modification
        with LSN ≤ checkpoint_lsn is on disk.

Step 3: Update the meta page (page 0) with the new checkpoint_lsn.
        This is the commit point: if we crash after step 3, recovery
        can skip all WAL entries with LSN ≤ checkpoint_lsn.

Step 4: Write the updated meta page to disk (flush again, just for page 0).

Step 5: Optionally truncate the WAL file, removing all entries with
        LSN ≤ checkpoint_lsn. (WAL rotation is planned — currently
        the WAL grows indefinitely and is truncated on checkpoint.)

If the process crashes between step 2 and step 3, the checkpoint LSN in the meta page still points to the previous checkpoint. Recovery replays from the old checkpoint LSN — this is safe because step 2 already flushed the pages.

Crash Recovery State Machine

AxiomDB tracks its recovery state through five well-defined phases. The state transitions are strictly sequential; no transition can be skipped.

CRASHED
   │
   │  detect: last shutdown was not clean (no clean-close marker)
   ▼
RECOVERING
   │
   │  open .db file: verify meta page checksum and format version
   │  open .wal file: verify WAL header magic and version
   ▼
REPLAYING_WAL
   │
   │  scan WAL forward from checkpoint_lsn
   │  for each entry with LSN > checkpoint_lsn:
   │    if entry.txn_id is in the committed_set:
   │      replay the mutation (redo)
   │    else:
   │      skip (uncommitted changes are discarded by ignoring)
   │
   │  committed_set = {txn_id for all txn_ids with a Commit entry in the WAL}
   ▼
VERIFYING
   │
   │  run heap structural check (all slot offsets within bounds,
   │  no overlapping tuples, free_start < free_end)
   │  run MVCC consistency check (xmin ≤ xmax for all live rows)
   ▼
READY
   │
   │  normal operation resumes

Why no UNDO pass

AxiomDB’s replay path is redo-only for the classic heap WAL entries that are already replayable. Uncommitted transactions are simply ignored during the forward scan. Because that heap WAL records physical locations (page_id, slot_id), the page that contained the uncommitted write is overwritten with the committed state from the WAL. If the page has no committed mutations after the checkpoint, it retains its pre-crash state (which was correct, because the checkpoint flushed all committed changes up to checkpoint_lsn).

This avoids the UNDO pass required by logical WALs (like PostgreSQL’s pg_wal), which must undo changes to B+ Tree pages in reverse order. Physical WAL with redo-only recovery is simpler and faster.

🚀

Faster Recovery — Single Forward Scan PostgreSQL's logical WAL requires two passes on recovery: a forward redo pass, then a backward undo pass to reverse uncommitted changes in B+ Tree pages. AxiomDB's classic heap WAL (recording exact page_id + slot_id) requires only one forward pass — uncommitted writes are simply overwritten by committed redo entries.

For clustered entries, 39.12 adds the first recovery extension on top of that model: unresolved clustered transactions are now undone by primary key and exact row image instead of returning NotImplemented. The remaining gap is narrower: clustered root persistence still depends on surviving WAL history and is not yet checkpoint/rotation-stable.

WalReader Design

WalReader is stateless. It stores only the file path. Each scan call opens a new File handle.

Forward scan (scan_forward): uses BufReader<File> to amortize syscall overhead on sequential reads. Reads are sequential and predictable — the OS readahead prefetches the next WAL sectors automatically.

Backward scan (scan_backward): uses a seekable File directly. BufReader would be counterproductive here because seeks invalidate the read buffer. Each backward step seeks to current_pos - 4 to read entry_len_2, then seeks back to current_pos - entry_len_2 to read the full entry.

Corruption handling: both iterators return Result<WalEntry>. On the first corrupt entry (truncated bytes, CRC mismatch, unknown entry type), the iterator yields an Err and stops. The caller decides whether to propagate or recover gracefully.

WAL and Concurrency

ConcurrentWalWriter (Phase 40.4)

ConcurrentWalWriter replaces the single-threaded WalWriter inside TxnManager. All public methods take &self — multiple transactions submit WAL entries without serializing on a single exclusive lock.

                 Thread A                Thread B
                    │                      │
       reserve_lsn()│ fetch_add(1,Relaxed) │reserve_lsn()   ← lock-free ~2 ns
                    │  serialize entries   │serialize entries ← fully parallel
                    │                      │
                Mutex<WriteQueue>::push()  │           ← ~1 µs each
                                           Mutex<WriteQueue>::push()
                    │                      │
             commit()                      │commit()
                    │                      │
               ┌────▼──────────────────────▼────┐
               │  Mutex<WriterState> (leader)    │  ← one leader per fsync batch
               │  drain_sorted() from queue      │
               │  write_entries() → BufWriter   │
               │  flush() → OS page cache       │
               │  fdatasync() → durable on disk │  ← one fsync covers all pending
               │  flushed_lsn.fetch_max(...)    │
               └────────────────────────────────┘

Lock ordering (no deadlock):

submit_entry: acquires queue_mutex only.
flush_and_sync: acquires writer_mutex first, then queue_mutex briefly for drain.
No function holds queue_mutex while waiting for writer_mutex.

Drop behavior: ConcurrentWalWriter::drop() calls flush_no_sync() — drains the queue and flushes the BufWriter to the OS page cache without fsync. This mirrors BufWriter<File>::drop and preserves crash-simulation semantics (durability tests call drop(mgr) to simulate a process exit with OS cache flushed).

🚀

Group Commit Advantage InnoDB's group commit amortizes fsync latency (~3–5 ms) across N concurrent transactions. AxiomDB's `ConcurrentWalWriter` applies the same model: 8 simultaneous autocommit INSERTs share one fsync instead of paying 8 × 5 ms = 40 ms. LSN reservation costs ~2 ns (uncontended `AtomicU64::fetch_add`) vs ~200 ns for PostgreSQL's spinlock-based insertion lock.

⚙️

Queue-Based vs Shared Log Buffer InnoDB uses a shared 16 MB circular log buffer where threads copy entries at reserved offsets. AxiomDB uses a `Vec`-based write queue instead — simpler to implement correctly and equally effective for group commit. Each transaction serializes into its own scratch buffer (`TxnManager::wal_scratch`) then submits the pre-serialized bytes. The leader sorts by `base_lsn` before writing, ensuring on-disk LSN order even when threads submit out of order.

Single-Writer Model (pre-40.4)

Before Phase 40.4, WAL writes serialized through a single WalWriter inside TxnManager. The server runtime uses Arc<tokio::sync::RwLock<Database>>: readers may overlap, but mutating statements still serialize behind the write guard. This eliminates write-write conflicts without record-level locking (Phase 13.7 will lift this constraint).

WAL Fsync Pipeline (Phase 6.19)

The old timer-based CommitCoordinator from 3.19 is now superseded in the server path by an always-on leader-based fsync pipeline inspired by MariaDB’s group_commit_lock.

Connections still write Commit entries into the WAL BufWriter, but the handoff after that changed:

the connection calls pipeline.acquire(commit_lsn, txn_id)
if another leader already flushed past commit_lsn → Expired
if no leader is active → Acquired, this connection performs flush+fsync
if a leader is active → Queued(rx), this connection releases the DB lock and awaits confirmation

Conn A → lock → DML → commit_deferred() → pipeline.acquire(42) → Acquired
         flush+fsync → release_ok(42) → unlock → OK

Conn B →           lock → DML → commit_deferred() → pipeline.acquire(43) → Queued(rx)
                   unlock → await rx ──────────────────────────────────────────────┐
Leader A fsync completes → flushed_lsn = 43 → wake B ─────────────────────────────┘

Conn C → lock → DML → commit_deferred() → pipeline.acquire(41) → Expired → OK

Durability Guarantee

A connection does not receive Ok until the fsync covering its Commit entry completes. max_committed advances only after the leader confirms durability. If the process crashes before that fsync, the transaction is lost and no client received Ok. The durability guarantee is therefore identical to inline fsync; only the scheduling changes.

Key Structures

Component	Location	Role
`FsyncPipeline`	`axiomdb-wal/src/fsync_pipeline.rs`	Shared state: `flushed_lsn`, `leader_active`, `pending_lsn`, waiter queue
`AcquireResult`	same file	`Expired` / `Acquired` / `Queued(rx)` outcome for each commit
`TxnManager::deferred_commit_mode`	`axiomdb-wal/src/txn.rs`	Internal hook used by the server path to defer inline fsync until the pipeline leader runs
`TxnManager::advance_committed()`	same file	Advances `max_committed` to `max(batch_txn_ids)` after fsync
`Database::take_commit_rx()`	`axiomdb-network/src/mysql/database.rs`	Bridges SQL execution to pipeline acquire / leader fsync / follower await

PageWrite Entry (Phase 3.18)

WalEntry::PageWrite (entry type 9) replaces N Insert entries with one entry per heap page during bulk inserts. Instead of serializing one entry per row, the executor groups rows by their target page and writes a single entry per page.

key:       page_id as u64 LE (8 bytes)
old_value: empty
new_value: [page_bytes: PAGE_SIZE][num_slots: u16 LE][slot_id × N: u16 LE]

The page_bytes field contains the full post-modification page (16 KB for the default page size). The embedded slot_ids let crash recovery undo uncommitted PageWrite entries at slot granularity — identical in effect to undoing N individual Insert entries.

CPU cost comparison for 10K-row bulk insert (~42 pages at 16 KB):

Insert path (3.17):  10,000 × serialize_into() + 10,000 × CRC32c  ← O(N rows)
PageWrite (3.18):        42 × serialize_into() +     42 × CRC32c  ← O(P pages) — 238× less

WAL file size comparison for 10K rows:

Insert entries:  10,000 × ~100B = ~1 MB
PageWrite:           42 × ~16.9 KB = ~710 KB  ← 30% smaller

🚀

238× Fewer WAL Serializations Than Per-Row Logging For a 10K-row bulk INSERT, AxiomDB writes 42 WAL entries (one per 16KB page) instead of 10,000. Crash recovery scans 42 entries instead of 10,000 — proportionally faster. PostgreSQL's COPY command uses the same page-image strategy for bulk loads; AxiomDB applies it automatically to all multi-row INSERT statements.

Crash recovery for uncommitted PageWrite:

for each PageWrite entry in uncommitted txn:
  page_id   = entry.key[0..8] as u64 LE
  num_slots = entry.new_value[PAGE_SIZE..+2] as u16 LE
  for i in 0..num_slots:
    slot_id = entry.new_value[PAGE_SIZE+2+i*2..+2] as u16 LE
    mark_slot_dead(storage, page_id, slot_id)   // same as undoing Insert

Batch WAL Append (Phase 3.17)

For bulk inserts (INSERT INTO t VALUES (r1),(r2),...) TxnManager::record_insert_batch() writes all N Insert WAL entries in a single write_all call:

Per-row path (before 3.17):
  for each of N rows: append_with_buf(entry, scratch)  ← N × write_all to BufWriter

Batch path (3.17):
  lsn_base = wal.reserve_lsns(N)
  for each row: entry.serialize_into(&mut wal_scratch)  ← accumulate in RAM
  wal.write_batch(&wal_scratch)                         ← 1 × write_all

The entries written to disk are byte-for-byte identical to the per-row path — crash recovery reads them the same way. The improvement is purely in CPU and syscall overhead: O(1) BufWriter calls instead of O(N).

Combined with HeapChain::insert_batch() (O(P) page writes for P pages) and a single parse+analyze pass for multi-row VALUES, the full bulk INSERT pipeline is O(P) in both storage I/O and WAL I/O, where P = number of pages filled ≈ N/200.

🚀

Performance Advantage MariaDB's `group_commit_lock` avoids waiting for a timer before piggybacking followers. AxiomDB now does the same: instead of batching only on a timeout window, queued commits can piggyback immediately on an in-flight leader fsync, which is exactly the case that matters for fast single-connection autocommit.

⚙️

Design Decision — Keep the Lock, Remove the Timer `FsyncPipeline` still uses a tiny synchronous mutex for the O(1) leader election state check, but AxiomDB rejects the old Tokio background task and configurable timer window. The lock is held only for state mutation; the actual flush+fsync still runs outside that mutex and under the existing database write lock.

Compact PageWrite Format

The WalEntry::PageWrite entry was updated to eliminate the 16 KB page image:

Old format (per page):

new_value = [page_bytes: 16384 B][num_slots: u16 LE][slot_ids: u16 × N]

New compact format (per page):

new_value = [num_slots: u16 LE][slot_ids: u16 × N]

Crash recovery only needs slot IDs to mark inserted slots dead on undo — it never uses the stored page bytes. Eliminating them reduces WAL size from ~820 KB to ~20 KB per 10K-row batch (40× reduction).

🚀

WAL Size Advantage Compact PageWrite reduces WAL data from 16 KB/page (full snapshot, like PostgreSQL's full-page-write mode) to ~400 B/page (slot list only). For 10K-row batch INSERT: 820 KB → 20 KB, matching MariaDB's InnoDB redo log density of ~50 B/row.

MVCC and Transactions

Multi-Version Concurrency Control (MVCC) is AxiomDB’s mechanism for deciding which row versions are visible to a given statement or transaction. This page documents the current implementation: the RowHeader format, the actual TransactionSnapshot type, the single-active-transaction TxnManager, and the server’s Arc<RwLock<Database>> concurrency model.

Implementation status: current code implements snapshot visibility, READ COMMITTED and REPEATABLE READ semantics, rollback/savepoints, deferred page reclamation, and concurrent read-only queries. It does not yet implement row-level writer concurrency, deadlock detection, SELECT ... FOR UPDATE, or full SSI. Those are planned in Phases 13.7, 13.8, and 13.8b.

Core Concepts

Transaction ID (TxnId)

Every explicit transaction receives a unique, monotonically increasing u64 identifier. The value 0 means “no active write transaction” and is used by autocommit reads.

Transaction Snapshot

A snapshot is the compact visibility token used by the current runtime.

#![allow(unused)]
fn main() {
pub struct TransactionSnapshot {
    pub snapshot_id: u64,
    pub current_txn_id: u64,
}
}

Meaning:

snapshot_id = max_committed + 1 at the moment the snapshot is taken
current_txn_id = txn_id of the active transaction, or 0 for read-only / autocommit reads

A row version is visible when:

txn_id_created == current_txn_id or txn_id_created < snapshot_id
and txn_id_deleted == 0, or txn_id_deleted >= snapshot_id and the delete was not performed by current_txn_id

RowHeader — Per-Row Versioning

Every heap tuple begins with a RowHeader:

Offset  Size  Field            Description
──────── ────── ─────────────── ───────────────────────────────────────────────
     0      8  txn_id_created   transaction that inserted this row version
     8      8  txn_id_deleted   transaction that deleted this row (0 = live)
    16      4  row_version      incremented on UPDATE
    20      4  _flags           reserved for future use
Total: 24 bytes

The full lifecycle of a row version:

INSERT in txn T1:
    RowHeader { txn_id_created: T1, txn_id_deleted: 0, row_version: 0 }

DELETE in txn T2:
    RowHeader { txn_id_created: T1, txn_id_deleted: T2, row_version: 0 }

UPDATE in txn T2 (implemented as DELETE + INSERT):
    Old version: RowHeader { txn_id_created: T1, txn_id_deleted: T2, row_version: N }
    New version: RowHeader { txn_id_created: T2, txn_id_deleted: 0,  row_version: N+1 }

Batch DELETE and Full-Table DELETE

When a DELETE has a WHERE clause, TableEngine::delete_rows_batch() collects all matching (page_id, slot_id) pairs and calls HeapChain::delete_batch() with them. Each affected slot receives xmax = txn_id and deleted = 1 in a single pass per page. The WAL receives one WalEntry::Delete per matched row (for correct per-row redo/undo).

When a DELETE has no WHERE clause or is a TRUNCATE TABLE, the executor takes a different path:

HeapChain::scan_rids_visible() collects live (page_id, slot_id) pairs without decoding row data.
HeapChain::delete_batch() marks all slots dead in O(P) page I/O.
A single WalEntry::Truncate is appended to the WAL instead of N per-row Delete entries.

The MVCC visibility result is identical to the per-row path: every slot has xmax = txn_id and deleted = 1, so any snapshot with xmax ≤ txn_id will see the row as deleted after the transaction commits. Concurrent readers that took their snapshot before this transaction began continue to see all rows as live throughout the delete — standard snapshot isolation.

Visibility Function

#![allow(unused)]
fn main() {
fn is_visible(row: &RowHeader, snap: &TransactionSnapshot, self_txn_id: u64) -> bool {
    let created_visible =
        row.txn_id_created == self_txn_id || row.txn_id_created < snap.snapshot_id;
    let not_deleted =
        row.txn_id_deleted == 0
        || (row.txn_id_deleted >= snap.snapshot_id
            && row.txn_id_deleted != self_txn_id);
    created_visible && not_deleted
}
}

TxnManager

The current TxnManager is a single-active-transaction coordinator. Read-only operations access it via shared refs for snapshot creation; mutating operations access it via &mut TxnManager for begin/commit/rollback.

#![allow(unused)]
fn main() {
pub struct TxnManager {
    wal: WalWriter,
    next_txn_id: u64,
    max_committed: u64,
    active: Option<ActiveTxn>,
}
}

This is the main reason the current server runtime is still single-writer for mutating statements: there is only one ActiveTxn slot for the whole opened database, not one write transaction owner per connection.

BEGIN

1. Verify `active.is_none()`
2. Assign `txn_id = next_txn_id`
3. Append `Begin` to the WAL
4. Set `active = Some(ActiveTxn { txn_id, snapshot_id_at_begin, ... })`
5. Increment `next_txn_id`

COMMIT

1. Append `Commit` to the WAL
2. Flush/fsync via the current durability policy or fsync pipeline
3. Advance `max_committed`
4. Clear `active`

ROLLBACK

1. Replay undo ops in reverse order
2. Append `Rollback` to the WAL
3. Clear `active`

Copy-on-Write B+ Tree and MVCC

The B+ Tree’s CoW semantics interact naturally with MVCC. When a writer creates a new page for an insert, concurrent readers continue accessing the old tree structure through the old root pointer they loaded at query start. The old pages are freed only when the writer’s root swap is complete AND all readers that loaded the old root have finished.

Since Phase 7.4, old pages enter the deferred free queue instead of being returned to the freelist immediately. This allows concurrent readers to continue accessing old tree structures through their snapshot while the writer has already swapped the root. Pages are released for reuse only when no active reader snapshot predates the free operation.

Current Server Lock Model (Phase 7.4 / 7.5)

The server wraps Database in Arc<RwLock<Database>>:

SELECT, SHOW, system variable queries acquire a read lock (db.read()). Multiple readers execute concurrently with zero coordination.
INSERT, UPDATE, DELETE, DDL, BEGIN/COMMIT/ROLLBACK acquire a write lock (db.write()). Only one writer at a time.
A read that already started keeps its snapshot while a writer commits.
New mutating statements queue behind the write lock at whole-database granularity.
Row-level locking is not implemented yet. That work starts in Phase 13.7.

The read-only executor path (execute_read_only_with_ctx) takes &dyn StorageEngine (shared ref) and &TxnManager (shared ref), ensuring it cannot mutate any state.

⚙️

Deliberate Interim Model PostgreSQL requires per-page lightweight locks (LWLock) for every buffer access. InnoDB requires per-page RwLock latches inside mini-transactions. AxiomDB readers need no per-page locks at all, but the current server still serializes all writes through a database-wide `RwLock`. This is an intentional intermediate step before Phase 13.7 adds row-level writer concurrency comparable to MySQL/InnoDB and PostgreSQL.

Isolation Levels — Implementation

READ COMMITTED

On every statement start within a transaction, a new snapshot is taken. The TransactionSnapshot passed to the analyzer and executor is refreshed per statement.

REPEATABLE READ

The snapshot is taken once at BEGIN and held for the entire transaction’s lifetime. All statements use the same snapshot.

The default isolation level is REPEATABLE READ (matching MySQL’s default). Autocommit single-statement queries always use READ COMMITTED semantics since there is only one statement to see.

INSERT … SELECT — Snapshot Isolation

INSERT INTO target SELECT ... FROM source executes the SELECT under the same snapshot that was fixed at BEGIN. This is critical for correctness:

The Halloween problem is a classic database bug where an INSERT ... SELECT on the same table re-reads rows it just inserted, causing an infinite loop (the database inserts rows, those rows qualify the SELECT condition, they get inserted again, ad infinitum).

AxiomDB prevents this automatically through MVCC snapshot semantics:

The snapshot is fixed at BEGIN: snapshot_id = max_committed + 1
Rows inserted by this statement get txn_id_created = current_txn_id
The MVCC visibility rule: a row is visible only if txn_id_created < snapshot_id
Since current_txn_id ≥ snapshot_id, newly inserted rows are never visible to the SELECT scan within the same transaction

Before BEGIN:    source = {row_A (xmin=1), row_B (xmin=2)}
Snapshot taken:  snapshot_id = 3

INSERT INTO source SELECT * FROM source:
  SELECT sees:  row_A (1 < 3 ✅), row_B (2 < 3 ✅)   → 2 rows
  Inserts:      row_C (xmin=3), row_D (xmin=3)        → 3 ≮ 3 ❌ not re-read
  SELECT stops:  only 2 original rows were seen

After COMMIT:    source = {row_A, row_B, row_C, row_D}  ← exactly 4 rows

This also means rows inserted by a concurrent transaction that commits after this transaction’s BEGIN are not seen by the SELECT — consistent snapshot throughout the entire INSERT operation.

MVCC on Secondary Indexes (Phase 7.3b)

Secondary indexes store (key, RecordId) pairs — they do not contain transaction IDs or version information. Visibility is always determined at the heap via the row’s txn_id_created / txn_id_deleted fields.

Lazy Index Deletion

When a row is DELETEd, non-unique secondary index entries are not removed. The heap row is marked deleted (txn_id_deleted = T), and the index entry becomes a “dead” entry. Readers filter dead entries via is_slot_visible() during index scans.

Unique, primary key, and FK auto-indexes still have their entries deleted immediately because the B-Tree enforces key uniqueness internally.

UPDATE and Dead Entries

When an UPDATE changes an indexed column:

Unique/PK/FK indexes: old entry deleted, new entry inserted (immediate)
Non-unique indexes: old entry left in place (lazy), new entry inserted

Both old and new entries coexist in the B-Tree. The old entry points to a heap row whose values no longer match the index key; is_slot_visible() filters it out.

Heap-Aware Uniqueness

When inserting into a unique index, if the key already exists, AxiomDB checks heap visibility before raising a UniqueViolation. If the existing entry points to a dead row (deleted or uncommitted), the insert proceeds — dead entries don’t block re-use of the same key value.

HOT Optimization

If an UPDATE does not change any column that participates in any secondary index, all index maintenance is skipped for that row — no B-Tree reads or writes. This is inspired by PostgreSQL’s Heap-Only Tuple (HOT) optimization.

ROLLBACK Support

Every new index entry (from INSERT or UPDATE) is recorded as UndoOp::UndoIndexInsert in the transaction’s undo log. On ROLLBACK, these entries are physically removed from the B-Tree. Old entries (from lazy delete) were never removed, so they’re naturally restored.

Vacuum

Dead index entries accumulate until vacuum removes them. A dead entry is one where is_slot_visible(entry.rid, oldest_active_snapshot) returns false — the pointed-to heap row is deleted and no active snapshot can see it.

⚙️

Design Decision — No txn_id in Index Entries (InnoDB Model) InnoDB, PostgreSQL, DuckDB, and SQLite all keep secondary indexes free of version information. AxiomDB follows this industry consensus: the heap is the source of truth for row visibility, and indexes are simple key-to-RecordId mappings. This avoids 8 bytes of overhead per index entry (which would reduce ORDER_LEAF from 217 to ~190) and simplifies the B-Tree implementation.

🚀

Zero-Cost DELETE for Non-Unique Indexes DELETE operations on tables with non-unique secondary indexes require zero index I/O — only the heap row is modified. InnoDB must write a delete-mark to each secondary index entry; AxiomDB skips the index entirely. For DELETE-heavy workloads with many non-unique indexes, this eliminates O(K × log N) B-Tree operations per deleted row (where K is the number of indexes).

VACUUM — Dead Row and Index Cleanup (Phase 7.11)

The VACUUM command physically removes dead rows and dead index entries:

VACUUM orders;     -- vacuum a specific table
VACUUM;            -- vacuum all tables

Heap Vacuum

For each page in the heap chain, VACUUM finds slots where txn_id_deleted != 0 and txn_id_deleted < oldest_safe_txn (the deletion is committed and no active snapshot can see it). These slots are zeroed via mark_slot_dead(), making them invisible to read_tuple() without even reading the RowHeader.

Under the current Arc<RwLock<Database>> architecture, oldest_safe_txn = max_committed + 1 — all committed deletions are safe because no reader holds an older snapshot.

Index Vacuum

For each catalog-visible B-Tree index, VACUUM performs a full B-Tree scan and checks heap visibility for each entry. Dead entries (pointing to vacuumed or deleted heap slots) are batch-deleted from the B-Tree. If bulk delete rotates the root, the updated root page ID is persisted back to the catalog in the same transaction.

Clustered Vacuum

Clustered tables now use a different purge path:

descend to the leftmost clustered leaf, then walk next_leaf
purge leaf cells whose txn_id_deleted != 0 && txn_id_deleted < oldest_safe_txn
free any overflow chain owned by the purged row
conditionally defragment clustered leaves with high freeblock waste
scan clustered secondary indexes and delete only entries whose PK bookmark no longer resolves to a physically present clustered row

This last rule is important: clustered secondary cleanup uses physical existence after purge, not snapshot visibility. An uncommitted clustered delete is invisible to the writer snapshot, but it is not safe to purge.

What VACUUM Does Not Do (Yet)

VACUUM FULL / table rewrite: heap pages still do slot-level cleanup only, while clustered pages only do local defragmentation; there is no full-table rewrite pass yet.
Automatic triggering: VACUUM must be invoked manually via SQL. Autovacuum with threshold-based triggering is planned.

⚙️

Design Decision — Heap Vacuum Without FSM PostgreSQL's lazy vacuum marks dead line pointers as LP_UNUSED and updates a free space map (FSM) so the space can be reused. AxiomDB keeps the heap path simpler: dead slots are zeroed but heap pages are not compacted or tracked through an FSM yet. Clustered VACUUM now does local leaf defragmentation instead, while full-table rewrite remains a separate enhancement.

Epoch-Based Page Reclamation (Phase 7.8)

When a writer performs Copy-on-Write on a B-Tree node, old pages are deferred (not immediately freed) because a concurrent reader might still reference them. The SnapshotRegistry tracks which snapshots are active across all connections:

#![allow(unused)]
fn main() {
pub struct SnapshotRegistry {
    slots: Vec<AtomicU64>,  // slot[conn_id] = snapshot_id or 0
}
}

Register: connection sets its slot before executing a read query
Unregister: connection clears its slot after the query completes
oldest_active(): returns the minimum non-zero slot, or u64::MAX if idle

On flush(), the storage layer calls release_deferred_frees(oldest_active()) to return only pages freed before the oldest active snapshot to the freelist. Pages freed after the oldest snapshot remain queued until all readers advance.

⚙️

Design Decision — Atomic Slot Array (DuckDB Model) DuckDB tracks lowest_active_start via an active transaction list. InnoDB uses clone_oldest_view() to merge all active ReadViews. AxiomDB uses a fixed-size atomic slot array (1024 slots) — O(N) scan without locking. Under the current RwLock model all slots are 0 during flush (writer has exclusive access), so the behavior is identical to the previous u64::MAX sentinel. The infrastructure is forward-compatible with future concurrent reader+writer models.

Clustered UPDATE In-Place Undo (Phase 39.22)

UndoClusteredFieldPatch

When fused_clustered_scan_patch applies zero-allocation in-place UPDATE (writing only the changed field bytes directly into the page buffer), the WAL undo log records a UndoClusteredFieldPatch entry instead of the full-row-image UndoClusteredRestore:

#![allow(unused)]
fn main() {
UndoClusteredFieldPatch {
    table_id: u32,
    key: Vec<u8>,            // PK bytes for leaf descent
    old_header: RowHeader,   // txn_id_created, row_version to restore
    field_deltas: Vec<FieldDelta>,  // each carries [u8;8] old_bytes
}
}

On ROLLBACK, the handler:

Looks up the clustered root via clustered_roots
Descends to the owning leaf via clustered_tree::descend_to_leaf_pub
Searches for the cell by PK key via clustered_leaf::search
For each FieldDelta, computes field_abs_off = row_data_abs_off + delta.offset and calls patch_field_in_place with delta.old_bytes[..delta.size]
Restores the RowHeader via update_row_header_in_place

This is O(fields_changed × 1) per row, vs O(row_size) for a full UndoClusteredRestore.

FieldDelta Inline Arrays

FieldDelta stores field bytes as fixed-size [u8;8] arrays (InnoDB field values for fixed-size types are at most 8 bytes for BIGINT/REAL):

#![allow(unused)]
fn main() {
// Before Phase 39.22
pub struct FieldDelta { pub offset: u16, pub size: u8, pub old_bytes: Vec<u8>, pub new_bytes: Vec<u8> }

// After Phase 39.22
pub struct FieldDelta { pub offset: u16, pub size: u8, pub old_bytes: [u8;8], pub new_bytes: [u8;8] }
}

WAL serialization writes only size bytes, so the on-disk format is identical. Recovery code reads back (u16, u8, &[u8]) tuples and copies them into [u8;8] arrays — no heap allocation during recovery either.

🚀

Zero Heap Per Field vs InnoDB / MariaDB InnoDB's undo records for in-place UPDATE store old field bytes in dynamically allocated undo page segments. AxiomDB's UndoClusteredFieldPatch stores them as inline [u8;8] arrays in the undo log entry — no heap allocation per field per row. For UPDATE t SET score = score + 1 across 25K rows, this eliminates ~50K allocations vs the old Vec<u8>-per-delta approach.

⚠️ Planned: Serializable Snapshot Isolation (Phase 7)

SSI detects read-write dependencies between concurrent transactions and aborts transactions that form a dangerous cycle. The implementation follows the algorithm from Cahill et al. (2008):

Each transaction tracks its rw-antidependencies (read sets and write sets).
At commit time, if the dependency graph contains a dangerous cycle (two transactions where each reads something the other wrote), one transaction is aborted with 40001 serialization_failure.

SSI provides true serializability (the strongest isolation level) with overhead proportional to the number of concurrent transactions and conflicts, not to the total number of rows.

B+ Tree — Hybrid Write Model

AxiomDB’s indexing layer is a persistent B+ Tree implemented over the StorageEngine trait. Every index — including primary key and unique constraint indexes — is one such tree.

Write model (Phase 5)

The tree uses a hybrid write model that minimizes page I/O on the hot path while keeping structural operations (splits, merges, rotations) on the safe allocate-new path:

Operation	Write path	Alloc/free
Insert, no leaf split	In-place: same leaf page ID	0 alloc / 0 free
Insert, child split absorbed by non-full parent	In-place: same parent page ID	0 alloc / 0 free for the parent
Insert, leaf or internal split	Structural: alloc 2 new pages, free 1	2 alloc / 1 free
Delete, leaf stays ≥ MIN_KEYS_LEAF	In-place: same leaf page ID	0 alloc / 0 free
Delete, parent pointer unchanged after child delete	Skip parent rewrite entirely	0 alloc / 0 free for the parent
Delete, leaf underflows → rebalance	Structural: alloc new leaf	1 alloc / 1 free
Batch delete, sorted exact keys	Page-local merge delete + one parent normalization pass	0 alloc / 0 free on non-underfull pages; structural only where underflow happens

This is the Phase 5 model for a serialized single writer (&mut self). Phase 7 will reintroduce the full Copy-on-Write path to reconcile with lock-free readers and epoch reclamation.

⚙️

Hybrid, Not Pure CoW The original Phase 2 design was fully Copy-on-Write — every write allocated a new page and freed the old one. Phase 5.17 introduces in-place writes for non-structural operations because the Phase 5 runtime is single-writer (&mut self on all mutations). Lock-free readers and epoch-based reclamation (Phase 7) will determine how much of the in-place model can be retained under concurrent read traffic.

Batch delete (`delete_many_in`) — sorted single-pass

Phase 5.19 adds a second delete mode to the tree:

#![allow(unused)]
fn main() {
BTree::delete_many_in(storage, &root_pid, &sorted_keys)
}

The contract is deliberately narrow:

the caller already knows the exact encoded keys to delete
keys are already sorted ascending
the tree does no predicate evaluation and no SQL-layer reasoning

The algorithm is page-local and ordered:

Leaf pages: merge the leaf’s sorted key array with the sorted delete slice and write one compacted survivor image.
Internal pages: partition the delete slice by child range, recurse once per affected child, then normalize the parent once.
Root collapse: run once at the very end of the batch.

This avoids the old N × delete_in(...) pattern where every key started from the root and independently decided whether to rewrite or rebalance the same pages.

⚙️

Design Decision — Ordered Bulk Path The exact-key batch path borrows from PostgreSQL's nbtree bulk-deletion mindset and InnoDB's bulk helpers: once the caller has the full sorted delete set, one ordered walk through the touched pages is safer and cheaper than reusing the point-delete API in a loop.

Page Capacity — Deriving ORDER_INTERNAL and ORDER_LEAF

Both node types must fit within PAGE_BODY_SIZE = 16,320 bytes (16 KB minus the 64-byte header). Each key occupies at most MAX_KEY_LEN = 64 bytes (zero-padded on disk).

Internal Node Capacity

An internal node with n separator keys has n + 1 child pointers.

Header:    1 (is_leaf) + 1 (_pad) + 2 (num_keys) + 4 (_pad)   =   8 bytes
key_lens:  n × 1                                                =   n bytes
children:  (n + 1) × 8                                         = 8n + 8 bytes
keys:      n × 64                                               = 64n bytes

Total = 8 + n + (8n + 8) + 64n = 16 + 73n

Solving 16 + 73n ≤ 16,320:

73n ≤ 16,304
  n ≤ 223.3

ORDER_INTERNAL = 223 (largest integer satisfying the constraint).

Total size: 16 + 73 × 223 = 16 + 16,279 = 16,295 bytes ≤ 16,320 ✓

⚙️

Design Decision — Why 16 KB Pages PAGE_SIZE = 16 KB was chosen to maximize B+ Tree fanout. With 4 KB pages (SQLite's default), ORDER_INTERNAL would be ~54 — requiring 4× more tree levels for the same number of rows, meaning more page reads per lookup. At ORDER_INTERNAL = 223, a billion-row table fits in a 4-level tree requiring only 4 page reads for a point lookup.

Leaf Node Capacity

A leaf node with n entries stores n keys and n record IDs. A RecordId is 10 bytes: page_id (u64, 8 bytes) + slot_id (u16, 2 bytes).

Header:    1 (is_leaf) + 1 (_pad) + 2 (num_keys) + 4 (_pad) + 8 (next_leaf) = 16 bytes
key_lens:  n × 1                                                              =  n bytes
rids:      n × 10                                                             = 10n bytes
keys:      n × 64                                                             = 64n bytes

Total = 16 + n + 10n + 64n = 16 + 75n

Solving 16 + 75n ≤ 16,320:

75n ≤ 16,304
  n ≤ 217.4

ORDER_LEAF = 217 (largest integer satisfying the constraint).

Total size: 16 + 75 × 217 = 16 + 16,275 = 16,291 bytes ≤ 16,320 ✓

On-Disk Page Layout

Both node types use #[repr(C)] structs with all-u8-array fields so that bytemuck::Pod (zero-copy cast) is safe without any implicit padding. All multi-byte fields are stored little-endian.

Internal Node (`InternalNodePage`)

Offset   Size   Field       Description
──────── ────── ─────────── ─────────────────────────────────────────────
       0      1  is_leaf     always 0
       1      1  _pad0       alignment
       2      2  num_keys    number of separator keys (u16 LE)
       4      4  _pad1       alignment
       8    223  key_lens    actual byte length of each key (0 = empty slot)
     231  1,792  children    224 × [u8;8] — child page IDs (u64 LE each)
   2,023 14,272  keys        223 × [u8;64] — separator keys, zero-padded
──────── ────── ─────────── ──────────────────────────────
Total:  16,295 bytes ≤ PAGE_BODY_SIZE ✓

This fixed-layout page is still the format used by the current production axiomdb-index::BTree. Phase 39 does not mutate this structure in place. Instead, the clustered rewrite is introducing separate storage-layer page primitives for clustered leaves and clustered internal nodes.

Clustered Internal Primitive (Phase 39.2)

The new clustered internal page lives in axiomdb-storage, not in the current axiomdb-index tree code. It uses a slotted variable-size layout:

[ClusteredInternalHeader: 16B]
  is_leaf = 0
  num_cells
  cell_content_start
  freeblock_offset
  leftmost_child
[CellPtr array]
[Free gap]
[Cells: right_child | key_len | key_bytes]

The important compatibility rule is semantic, not structural:

separator keys stay sorted
find_child_idx(search_key) still returns the first separator strictly greater than the search key
logical child 0 comes from leftmost_child
logical child i > 0 comes from separator cell i - 1

That lets the clustered storage rewrite preserve B-tree navigation behavior without reusing the old fixed-size MAX_KEY_LEN = 64 layout.

⚙️

Design Decision — Preserve Traversal Contract The clustered rewrite rejects the simpler "just keep a second variable-size child array" approach. Encoding `leftmost_child` in the header and `right_child` in each separator cell keeps page-local mutation simple while preserving the exact traversal semantics the current tree already depends on.

Clustered Insert Controller (Phase 39.3)

Phase 39.3 does not retrofit the current axiomdb-index::BTree into a generic tree over fixed and clustered pages. Instead, axiomdb-storage contains a dedicated controller in clustered_tree.rs that proves the first full write path for clustered pages while the SQL executor still uses the classic heap + index engine.

Algorithm shape:

insert(storage, root_opt, ...) bootstraps a clustered leaf root if needed.
Recursive descent chooses child pointers from ClusteredInternal.
Leaf inserts stay in-place when the physical clustered-row descriptor fits.
Large logical rows use local-prefix + overflow-page descriptors instead of an inline-only reject path.
Fragmented leaves/internal pages call defragment() once before split.
Leaf splits rebuild left/right pages by cumulative cell footprint.
Internal splits rebuild left/right separator sets and promote one separator.
Root overflow creates a fresh ClusteredInternal root.

Unlike the old structural Copy-on-Write tree, clustered 39.3 keeps the old page ID as the left half on split and allocates only the new right sibling. That is a conscious storage-first choice for the current single-writer runtime, not the final concurrency model.

⚙️

Design Decision — Stable Left Page ID The simpler “allocate two fresh pages on every clustered split” option would mimic the old CoW tree, but it would also force needless parent pointer churn. AxiomDB keeps the left page stable and allocates only the new right sibling until Phase 39 reaches WAL, recovery, and executor integration.

Clustered Point Lookup Controller (Phase 39.4)

Phase 39.4 extends that dedicated clustered controller with exact point lookup:

descend internal pages by separator key
search the target leaf by exact key
reconstruct the full logical row from the local leaf payload plus any overflow-page tail
filter the hit through RowHeader::is_visible(snapshot)

The important scope cut is semantic rather than structural: the controller can read the current inline row version, but it cannot yet chase older versions because clustered older-version reconstruction is still future work. 39.11 adds rollback/savepoint restore for clustered writes, but not undo-aware read traversal for arbitrary snapshots.

That means the current lookup(...) contract is:

visible hit → Some(ClusteredRow)
key absent → None
current inline version invisible → None

This is a deliberate intermediate contract for the storage rewrite, not the final SQL-visible clustered read semantics.

⚙️

Design Decision — Exact Tree Search, No Leaf Walk SQLite-style leaf chains are useful for range scans, but point lookup still belongs on the tree path. AxiomDB keeps clustered point reads as true root-to-leaf binary search instead of falling back to a leaf-chain probe that would blur the boundary with Phase 39.5.

Clustered Range Scan Controller (Phase 39.5)

Phase 39.5 adds the first ordered scan controller on top of the clustered pages:

determine the first relevant leaf for the lower bound
determine the first relevant slot inside that leaf
reconstruct and yield visible logical rows in ascending primary-key order
follow next_leaf across the leaf chain
stop as soon as the upper bound is exceeded

The controller is intentionally separate from the old fixed-layout axiomdb-index::RangeIter. The two trees now have different physical layouts and different row payload semantics:

classic B+ Tree leaf: (key, RecordId)
clustered leaf: (key, RowHeader, total_row_len, local_prefix, overflow_ptr?)

So the right reuse point is the iterator shape, not the implementation.

⚙️

Design Decision — Borrow Cursor Shape, Not Code SQLite's `sqlite3BtreeFirst()` / `sqlite3BtreeNext()` and MariaDB's `read_range_first()` / `read_range_next()` both use a “seek once, then iterate” cursor model. AxiomDB borrows that control flow, but keeps a dedicated clustered iterator instead of forcing the fixed-layout B+ Tree range code to pretend clustered leaves still carry `RecordId`s.

The semantic boundary remains the same as in 39.4: the range iterator can return or skip only the current inline version. Undo-aware older-version reconstruction is still future work.

Clustered Update Controller (Phase 39.6)

Phase 39.6 adds the first mutation path that rewrites an existing clustered row in place:

descend to the owning leaf by exact primary key
check visibility of the current inline version
build the replacement inline header with a bumped row_version
materialize either an inline or overflow-backed replacement descriptor
rewrite the exact leaf cell without changing the key
keep the row in the same leaf or fail explicitly

This controller is intentionally narrower than a full B-tree update:

it does not move the row to another leaf
it does not split or merge the tree
it does not touch parent separators
it does not maintain secondary indexes yet

That makes 39.6 a true clustered-storage step, not a disguised merge of 39.6, 39.7, 39.8, and 39.9.

⚙️

Design Decision — Update Is Not Delete Plus Insert The easiest generic implementation would be delete+insert through the whole tree, but that would blur clustered update, delete, split/merge, and secondary-index bookmark work into one phase. AxiomDB keeps 39.6 strict: same-leaf rewrite only, with explicit failure when structural relocation would be needed.

The page-local rewrite itself has two modes:

overwrite the existing cell directly when the replacement encoded payload fits the current cell budget
rebuild the same leaf compactly when the row grows but still fits on that page

If neither is possible, the controller returns HeapPageFull.

Clustered Delete Controller (Phase 39.7)

Phase 39.7 adds the first logical delete path over clustered rows:

descend to the owning leaf by exact primary key
check visibility of the current inline version
preserve the existing key, row payload, txn_id_created, and row_version
stamp txn_id_deleted = delete_txn_id
persist the same leaf page without structural tree change

This controller is intentionally narrower than a full B-tree delete:

it does not remove the physical cell
it does not merge or rebalance leaves
it does not change parent separators
it does not maintain secondary indexes yet

That makes 39.7 the logical-delete companion to 39.6, not a disguised merge of 39.7, 39.8, 39.11, and 39.18.

⚙️

Design Decision — Delete Is Header State First InnoDB's clustered records are delete-marked before purge, and PostgreSQL tuples stay on page until vacuum cleanup. AxiomDB adopts the same sequencing in 39.7: logical delete mutates row state now, while structural cleanup waits for later phases that can prove no snapshot still needs the row.

Clustered Structural Controller (Phase 39.8)

Phase 39.8 adds the first controller that can structurally shrink and rebalance the clustered tree:

call update_in_place(...) as the fast path
on HeapPageFull, load the visible current row
physically delete the exact clustered cell through a private tree path
propagate two signals upward:
- underfull — the child now needs sibling redistribute/merge
- min_changed — the child’s minimum key changed and the parent separator must be repaired
rebalance clustered leaf and internal siblings by encoded byte volume
collapse an empty internal root
reinsert the replacement row through the clustered insert controller

That makes 39.8 the structural companion to 39.6 and 39.7, not a shortcut around later purge / undo / secondary-index phases.

⚙️

Design Decision — No Fixed MIN_KEYS The old `axiomdb-index::BTree` can rebalance with fixed `MIN_KEYS_*` thresholds because every slot costs the same. Clustered pages cannot: 39.8 rebalances variable-size siblings by encoded byte volume, following SQLite-style occupancy logic instead of fixed-key-count heuristics.

⚙️

Design Decision — Keep Delete Mark Separate MariaDB InnoDB separates delete-mark from later purge/compression work. AxiomDB keeps that same split in 39.8: physical delete exists only as a private helper for relocate-update, while public clustered delete remains logical until snapshot-safe purge exists.

Current 39.8 limits remain explicit:

relocate-update still rewrites only the current inline version
public delete still does not purge dead clustered cells
parent separator repair does not yet split the parent if the repaired key itself overflows the page budget

Clustered Secondary Bookmarks (Phase 39.9)

Phase 39.9 adds a dedicated bookmark-bearing secondary-key layout in axiomdb-sql::clustered_secondary.

Instead of treating the BTree payload RecordId as the row locator, the clustered path now encodes the physical secondary key as:

secondary_logical_key ++ missing_primary_key_columns

That gives the future clustered executor the exact secondary -> primary key bridge it needs:

scan the secondary B-tree by logical key prefix
decode the appended PK bookmark from the physical secondary key
probe the clustered tree by that primary key

⚙️

Design Decision — Do Not Trust RID Payload MySQL InnoDB and SQLite `WITHOUT ROWID` both treat the table key, not a heap slot, as the durable secondary bookmark. AxiomDB now follows that rule in the clustered path: the old fixed-size `RecordId` payload is kept only because the existing B-tree page format still requires it, not because clustered identity depends on it.

This subphase is intentionally narrower than full executor integration:

heap-visible SQL still uses RecordId-based secondaries
clustered bookmark scans are a dedicated path, not a replacement for the old planner/executor yet
unique clustered secondaries check logical-key conflicts before insert, even though the physical key contains a PK suffix for stable row identity

Clustered Overflow Pages (Phase 39.10)

Phase 39.10 adds the first large-row storage layer for clustered leaves.

The physical clustered leaf cell is now:

[key_len: u16]
[total_row_len: u32]
[RowHeader: 24B]
[key bytes]
[local row prefix]
[overflow_first_page?: u64]

And the overflow-page chain is:

[next_overflow_page: u64]
[payload bytes]

Important invariant:

split / merge / rebalance reason about the physical leaf footprint
lookup / range reconstruct the logical row bytes only when returning rows
the primary key and RowHeader never leave the clustered leaf page

⚙️

Design Decision — Keep MVCC Inline PostgreSQL TOAST and InnoDB off-page columns both move large values away from the main record, but AxiomDB cannot move the clustered `RowHeader` off-page in 39.10 because Phase 39 still performs MVCC visibility checks directly at the leaf. The adaptation is narrower on purpose: spill only the row tail, never the key or header.

This is still narrower than full large-value support:

no generic TOAST/BLOB reference layer yet
no compression yet
39.11 adds internal WAL / rollback for clustered row images
39.12 adds clustered crash recovery for those row images
delete-mark keeps the overflow chain reachable until later purge

Clustered WAL and Recovery (Phases 39.11 / 39.12)

Phases 39.11 and 39.12 add the first clustered durability contract on top of the new page formats:

clustered inserts append EntryType::ClusteredInsert
clustered delete-marks append EntryType::ClusteredDeleteMark
clustered updates append EntryType::ClusteredUpdate
each WAL key is the primary key, not a physical slot identifier
each payload stores an exact ClusteredRowImage
TxnManager tracks the latest clustered root per table_id
rollback and crash recovery restore logical row state by primary key and exact row image

This controller is still intentionally narrower than a full topology-physical recovery story. 39.12 closes clustered crash recovery by reusing the same PK + row-image semantics, while exact root persistence beyond WAL checkpoint/rotation remains future work.

⚙️

Design Decision — B-Tree WAL Is Logical First PostgreSQL's B-tree WAL is page-topology-oriented because page identity is the recovery primitive. AxiomDB rejects that as the first clustered cut: once slotted clustered pages can defragment and relocate rows, the stable identity is the primary key and exact row image, not the old leaf slot.

Rollback therefore promises logical row restoration, not exact physical topology restoration. A relocate-update may leave a different split/merge shape after rollback as long as the old primary-key row is back.

SQL-Visible Clustered INSERT (Phase 39.14)

Phase 39.14 is the first point where the SQL executor writes into the clustered tree instead of only the storage tests doing so.

The executor branch now does this:

resolve the clustered table plus its logical primary index metadata
derive PK bytes from that primary-index column order
check for a visible existing PK through clustered lookup
insert the new row through clustered_tree::insert(...), or reuse a snapshot-invisible delete-marked physical key through restore_exact_row_image(...)
maintain non-primary indexes as secondary_key ++ pk_suffix bookmarks
persist the final clustered table root and any changed secondary roots

Fresh clustered keys emit EntryType::ClusteredInsert. Reused delete-marked physical keys emit EntryType::ClusteredUpdate so rollback can restore the old tombstone image instead of merely deleting the newly-inserted row.

⚙️

Design Decision — Reuse Tombstone Via Update Undo InnoDB can reuse a delete-marked clustered record because undo restores the older image. AxiomDB adopts that clustered-key reuse rule in `39.14`, but logs the reuse as a clustered update rather than a clustered insert so rollback preserves the superseded tombstone image exactly.

This is still narrower than final clustered SQL behavior:

clustered UPDATE is now SQL-visible in 39.16
clustered DELETE is now SQL-visible as logical delete-mark in 39.17
clustered secondary covering reads still normalize back to clustered row fetches until a true clustered index-only optimization exists
older-snapshot reconstruction after reusing a tombstoned PK still depends on later clustered version-chain work

Leaf Node (`LeafNodePage`)

Offset   Size   Field       Description
──────── ────── ─────────── ─────────────────────────────────────────────
       0      1  is_leaf     always 1
       1      1  _pad0       alignment
       2      2  num_keys    number of (key, rid) pairs (u16 LE)
       4      4  _pad1       alignment
       8      8  next_leaf   page_id of the next leaf (u64 LE); u64::MAX = no next
      16    217  key_lens    actual byte length of each key
     233  2,170  rids        217 × [u8;10] — RecordId (page_id:8 + slot_id:2)
   2,403 13,888  keys        217 × [u8;64] — keys, zero-padded
──────── ────── ─────────── ──────────────────────────────
Total:  16,291 bytes ≤ PAGE_BODY_SIZE ✓

Copy-on-Write Root Swap

The root page ID is stored in an AtomicU64. Writers and readers interact with it as follows.

Reader Path

#![allow(unused)]
fn main() {
// Acquire load: guaranteed to see all writes that happened before
// the Release store that set this root.
let root_id = self.root.load(Ordering::Acquire);
let root_page = storage.read_page(root_id)?;
// traverse down — no locks acquired
}

Writer Path

#![allow(unused)]
fn main() {
// 1. Load the current root
let old_root_id = self.root.load(Ordering::Acquire);

// 2. Walk from old_root down to the target leaf, collecting the path
let path = find_path(&storage, old_root_id, key)?;

// 3. For each node on the path (leaf first, then up to root):
//    a. alloc_page → new_page_id
//    b. copy content from old page
//    c. apply the mutation (insert key/split/rebalance)
//    d. update the parent's child pointer to new_page_id

// 4. The new root was written as a new page
let new_root_id = path[0].new_page_id;

// 5. Atomic swap — Release store: all prior writes visible to Acquire loads
self.root.store(new_root_id, Ordering::Release);

// 6. Free the old path pages (only safe after all readers have moved on)
for old_id in old_page_ids { storage.free_page(old_id)?; }
}

A reader that loaded old_root_id before the swap continues accessing old pages safely — they are freed only after all reads complete (tracked in Phase 7 with epoch-based reclamation).

🚀

Lock-Free Reads Readers load the root pointer with Acquire semantics and traverse the tree without acquiring any lock. A write in progress is invisible to readers until the Release store completes — at which point the entire new subtree is already consistent. This is what allows read throughput to scale linearly with core count.

Why next_leaf Is Not Used in Range Scans

The leaf node format includes a next_leaf pointer for a traditional linked-list traversal across leaf nodes. However, this pointer is not used by RangeIter.

Reason: Under CoW, when a leaf is split or modified, a new page is created. The previous leaf in the linked list still points to the old page (L_old), which has already been freed. Keeping the linked list consistent under CoW would require copying the previous leaf on every split — but finding the previous leaf during an insert requires traversing from the root (the tree has no backward pointers).

Adopted solution: RangeIter re-traverses the tree from the root to find the next leaf when crossing a leaf boundary. The cost is O(log n) per boundary crossing, not O(1) as with a linked list. For a tree of 1 billion rows with ORDER_LEAF = 217, the depth is log₂₁₇(10⁹) ≈ 4, so each boundary crossing is 4 page reads. Measured cost for a range scan of 10,000 rows: 0.61 ms — well within the 45 ms budget.

⚙️

Design Decision — Re-traversal vs. Linked-List Leaf Scan The next_leaf pointer exists on-disk but RangeIter does not use it. Under CoW, keeping a consistent linked list would require copying the previous leaf on every split — which itself requires finding that leaf from the root. Re-traversal costs O(log n) per leaf boundary (4 reads at 1B rows) and is simpler to reason about correctly.

Insert — CoW Split Protocol

1. Descend from root to the target leaf, recording the path.

2. If the leaf has room (num_keys < fill_threshold):
   → Copy the leaf to a new page.
   → Insert the new (key, rid) in sorted position.
   → Update the parent's child pointer (CoW the parent too).
   → Propagate CoW up to the root.

3. If the leaf is at or above the fill threshold:
   → Allocate two new leaf pages.
   → Distribute: left gets floor((ORDER_LEAF+1)/2) entries,
                 right gets the remaining entries.
   → The smallest key of the right leaf becomes the separator key
     pushed up to the parent.
   → CoW the parent, insert the new separator and child pointer.
   → If the parent is also full, recursively split upward.
   → If the root splits, allocate a new root with two children.

The split point fill_threshold depends on the index fill factor (see below). Internal pages always split at ORDER_INTERNAL regardless of fill factor.

Fill Factor — Adaptive Leaf Splits

The fill factor controls how full leaf pages are allowed to get before splitting. It is set per-index via WITH (fillfactor=N) on CREATE INDEX and stored in IndexDef.fillfactor: u8.

Formula

fill_threshold(order, ff) = ⌈order × ff / 100⌉   (integer ceiling division)

fillfactor	fill_threshold (ORDER_LEAF = 216)	Effect
100 (compact)	216	Splits only when completely full — max density, slowest inserts on busy pages
90 (default)	195	Leaves ~10% free — balances density and insert speed
70 (write-heavy)	152	Leaves ~30% free — fewer splits for append-heavy workloads
10 (minimum)	22	Very sparse pages — extreme fragmentation, rarely useful

A compile-time assert verifies that fill_threshold(ORDER_LEAF, 100) == ORDER_LEAF, ensuring fillfactor=100 always preserves the original behavior exactly.

⚙️

Design Decision Internal pages are not affected by fill factor — they always split at ORDER_INTERNAL. Only leaf splits benefit from the extra free space, because inserts always land on leaf pages. Applying fill factor to internal pages would reduce tree fan-out without any benefit for typical insert patterns, matching PostgreSQL's implementation of the same concept.

Catalog field

IndexDef.fillfactor is serialized as a single byte appended after the predicate section in the catalog heap entry. Pre-6.8 index rows are read with a default of 90 (backward-compatible). Valid range: 10–100; values outside this range are rejected at CREATE INDEX parse time with a ParseError.

When to use a lower fill factor

Append-heavy tables — rows inserted in bulk after the index is created. A fill factor of 70–80 prevents cascading splits during the bulk load.
Write-heavy OLTP — high-frequency single-row inserts that land on the same hot pages. More free space means fewer page splits per second.
Read-heavy / archival — use fillfactor=100. Maximum density reduces I/O for full scans at the cost of slower writes.

Minimum Occupancy Invariant

All nodes except the root must remain at least half full after any operation:

Internal nodes: num_keys ≥ ORDER_INTERNAL / 2 = 111
Leaf nodes: num_keys ≥ ORDER_LEAF / 2 = 108

Violations of this invariant during delete trigger rebalancing (redistribution from a sibling if possible, merge otherwise).

`rotate_right` key-shift invariant

When rotate_right borrows the last key of the left sibling and inserts it at position 0 of the underfull child (internal node case), all existing keys in the child must be shifted right by one position before inserting the new key.

The shift must cover positions 0..cn → 1..cn+1, implemented with a reverse loop (same pattern as insert_at). Using slice::rotate_right(1) on [..cn] is incorrect: it moves key[cn-1] to position 0 where it is immediately overwritten, leaving position cn with stale data from a previous operation. The stale byte can exceed MAX_KEY_LEN = 64, causing a bounds panic on the next traversal of that node.

#![allow(unused)]
fn main() {
// Correct: explicit reverse loop
for i in (0..cn).rev() {
    child.key_lens[i + 1] = child.key_lens[i];
    child.keys[i + 1]     = child.keys[i];
}
child.key_lens[0] = sep_len;
child.keys[0]     = sep_key;
}

Prefix Compression — In-Memory Only

Internal node keys are often highly redundant. For a tree indexing sequential IDs, consecutive separator keys share long common prefixes. AxiomDB implements CompressedNode as an in-memory representation:

#![allow(unused)]
fn main() {
struct CompressedNode {
    prefix: Box<[u8]>,          // longest common prefix of all keys in this node
    suffixes: Vec<Box<[u8]>>,   // remainder of each key after stripping the prefix
}
}

When an internal node page is read from disk, it is optionally decompressed into a CompressedNode for faster binary search (searching on suffix bytes only). When the node is written back, the full keys are reconstructed. This is a read optimization only — the on-disk format always stores full keys.

The compression ratio depends on key structure. For an 8-byte integer key, there is no prefix to compress. For a 64-byte composite key (category_id || product_name), the category_id prefix is shared across many consecutive keys and is compressed away.

Tree Height and Fan-Out

Rows	Tree depth	Notes
1–217	1 (root = leaf)	Entire tree is one leaf page
218–47,089	2	One root internal + up to 218 leaves
47K–10.2M	3	Two levels of internals
10.2M–2.22B	4	Covers billion-row tables comfortably
>2.22B	5	Rare; still fast at O(log n) traversal

A tree of 1 billion rows has depth 4 — a point lookup requires reading 4 pages (1 per level). At 16 KB pages, a warm cache point lookup is ~4 memory accesses with no disk I/O.

Static API — Shared-Storage Operations (Phase 6.2)

BTree normally owns its Box<dyn StorageEngine>. This is convenient for tests but prevents sharing one MmapStorage between the table heap and multiple indexes. Phase 6.2 adds static functions that accept an external &mut dyn StorageEngine:

#![allow(unused)]
fn main() {
// Point lookup — read-only, no ownership needed
BTree::lookup_in(storage: &dyn StorageEngine, root_pid: u64, key: &[u8])
    -> Result<Option<RecordId>, DbError>

// Insert — mutates storage, updates root_pid atomically on root split
BTree::insert_in(storage: &mut dyn StorageEngine, root_pid: &AtomicU64, key: &[u8], rid: RecordId)
    -> Result<(), DbError>

// Delete — mutates storage, updates root_pid atomically on root collapse
BTree::delete_in(storage: &mut dyn StorageEngine, root_pid: &AtomicU64, key: &[u8])
    -> Result<bool, DbError>

// Batch delete — removes many pre-sorted keys in one left-to-right pass (5.19)
BTree::delete_many_in(storage: &mut dyn StorageEngine, root_pid: &AtomicU64, keys: &[Vec<u8>])
    -> Result<(), DbError>

// Range scan — collects all (RecordId, key_bytes) in [lo, hi] into a Vec
BTree::range_in(storage: &dyn StorageEngine, root_pid: u64, lo: Option<&[u8]>, hi: Option<&[u8]>)
    -> Result<Vec<(RecordId, Vec<u8>)>, DbError>
}

These delegate to the same private helpers as the owned API. The insert_in and delete_in variants use AtomicU64::store(Release) instead of compare_exchange (safe in Phase 6 — single writer).

Batch delete primitive (`delete_many_in`) — subphase 5.19

delete_many_in accepts a slice of pre-sorted encoded keys and removes all of them from one index in a single left-to-right tree traversal. The caller is responsible for sorting keys ascending before the call; the primitive enforces this as a precondition.

Algorithm:

batch_delete_subtree(root) — dispatches on node type.
Leaf node: binary-search the sorted keys against the leaf’s key array. Remove all matching slots in one pass, compact in-place, write the page once. If the leaf becomes underfull, signal the parent for merge/redistribute.
Internal node: binary-partition the key slice by separator keys so each child subtree receives only the keys that fall within its range. Recurse into each child that has at least one key to remove. After all children return, rewrite the internal node once if any child pid or separator changed; skip the rewrite otherwise.
After the recursive pass, root_pid is updated atomically once via AtomicU64::store(Release).

Invariants preserved:

Tree height stays balanced (leaf depth is uniform after the pass).
In-place fast path from 5.17 is reused: leaf and internal rewrites skip page alloc/free when the node fits in the same page.
Root is persisted exactly once per delete_many_in call regardless of how many keys were removed.

🚀

Performance Advantage A `DELETE WHERE` touching N rows previously called `BTree::delete_in` N times, descending from the root on each call — O(N log N) page reads and writes total. `delete_many_in` descends the tree once, partitioning the sorted key set at each internal node, yielding O(N + H·B) work where H is tree height and B is the branching factor. At 5,000 rows this eliminates 5,000 separate root descents per index. InnoDB defers this cost via its change buffer; AxiomDB eliminates it upfront with a single sorted pass — no background merge worker required.

⚙️

Design Decision range_in returns Vec<(RecordId, Vec<u8>)> rather than an iterator to avoid lifetime conflicts between the borrow of storage needed to drive the iterator and the caller's existing `&mut storage` borrow. The heap reads happen after the range scan completes, which requires full ownership of the results.

Order-Preserving Key Encoding (Phase 6.1b)

Secondary index keys are encoded as byte slices in axiomdb-sql/src/key_encoding.rs such that encode(a) < encode(b) iff a < b under SQL comparison semantics. Each Value variant is prefixed with a 1-byte type tag:

Type	Tag	Payload	Order property
`NULL`	0x00	none	Sorts before all non-NULL
`Bool`	0x01	1 byte	false < true
`Int(i32)`	0x02	8 BE bytes after `n ^ i64::MIN`	Negative < positive
`BigInt(i64)`	0x03	8 BE bytes after `n ^ i64::MIN`	Negative < positive
`Real(f64)`	0x04	8 bytes (NaN=0, pos=MSB set, neg=all flipped)	IEEE order
`Decimal(i128, u8)`	0x05	1 (scale) + 16 BE bytes after sign-flip
`Date(i32)`	0x06	8 BE bytes after sign-flip
`Timestamp(i64)`	0x07	8 BE bytes after sign-flip	Older < newer
`Text`	0x08	NUL-terminated UTF-8, 0x00 escaped as `[0xFF, 0x00]`	Lexicographic
`Bytes`	0x09	NUL-terminated, same escape	Lexicographic
`Uuid`	0x0A	16 raw bytes	Lexicographic

For composite keys the encodings are concatenated — the first column has the most significant sort influence.

NULL handling: NULL values are not inserted into secondary index B-Trees. This is consistent with SQL semantics (NULL ≠ NULL) and avoids DuplicateKey errors when multiple NULLs appear in a UNIQUE index. WHERE col = NULL always falls through to a full scan.

Maximum key length: 768 bytes. Keys exceeding this return DbError::IndexKeyTooLong and are silently skipped during CREATE INDEX.

⚙️

Design Decision Integer sign-flip (`n ^ i64::MIN`) converts a signed two's-complement integer into an unsigned value that sorts in the same order. This is the same technique used by RocksDB's `WriteBatchWithIndex`, CockroachDB's key encoding, and PostgreSQL's `btint4cmp` — proven correct and branch-free at O(1).

Catalog System

The catalog is AxiomDB’s schema repository. It stores the definition of logical databases, tables, columns, indexes, constraints, foreign keys, and planner statistics, then makes that information available to the SQL analyzer and executor through a consistent, MVCC-aware reader interface.

Design Goals

Self-describing: The catalog tables are themselves stored as regular heap pages. The engine needs no external schema file.
Persistent: Catalog data survives crashes. The WAL treats catalog mutations like any other transaction.
MVCC-visible: A DDL statement that creates a table is visible to subsequent statements in the same transaction but invisible to concurrent transactions until committed.
Bootstrappable: An empty database file contains no catalog rows. The first open() runs a special bootstrap path that allocates the catalog roots and inserts the default logical database axiomdb.

System Tables

The catalog consists of eight logical heaps rooted from the meta page. User-facing introspection is documented in Catalog & Schema.

Table	Meta offset	Contents
`axiom_tables`	32	One row per user-visible table
`axiom_columns`	40	One row per column, in declaration order
`axiom_indexes`	48	One row per index (includes partial index predicate since Phase 6.7)
`axiom_constraints`	72	Named CHECK constraints (Phase 4.22b)
`axiom_foreign_keys`	84	One row per FK constraint (Phase 6.5)
`axiom_stats`	96	Per-column NDV and row_count for planner (Phase 6.10)
`axiom_databases`	104	One row per logical database
`axiom_table_databases`	112	Optional table ownership binding by database

Each root page is stored at the corresponding u64 body offset in the meta page (page 0). Older database files may have 0 in the new database offsets; the open path upgrades them lazily by allocating the roots and inserting axiomdb.

⚙️

Design Decision — Separate DB Ownership AxiomDB deliberately does not overload schema_name inside TableDef to fake a database namespace. Keeping database ownership in axiom_table_databases preserves on-disk compatibility now and leaves real CREATE SCHEMA room later, unlike a shortcut that would collapse two separate namespaces into one field.

`axiom_databases` row format (`DatabaseDef`)

[name_len: 1 byte u8]
[name:     name_len UTF-8 bytes]

Fresh databases always contain:

axiomdb

`axiom_table_databases` row format (`TableDatabaseDef`)

[table_id:        4 bytes LE u32]
[name_len:        1 byte  u8]
[database_name:   name_len UTF-8 bytes]

Missing binding row means: this is a legacy table owned by axiomdb.

`axiom_stats` row format (`StatsDef`)

[table_id:  4 bytes LE u32]
[col_idx:   2 bytes LE u16]
[row_count: 8 bytes LE u64]  — visible rows at last ANALYZE / CREATE INDEX
[ndv:       8 bytes LE i64]  — distinct non-NULL values (PostgreSQL stadistinct encoding)

ndv encoding (same as PostgreSQL stadistinct):

> 0 → absolute count (e.g. 9999 unique emails)
= 0 → unknown → planner uses DEFAULT_NUM_DISTINCT = 200

Stats root is lazily initialized at first write (ensure_stats_root). Pre-6.10 databases open without migration: list_stats returns empty vec when root = 0, causing the planner to use the conservative default (always use index).

Stats are bootstrapped at CREATE INDEX time by reusing the table scan already performed for B-Tree build — no extra I/O. ANALYZE TABLE refreshes them with an exact full-table NDV count.

⚙️

Design Decision — Exact NDV, Not Sampling AxiomDB computes exact distinct value counts using a HashSet of encoded key bytes. PostgreSQL uses Vitter's reservoir sampling algorithm (Duj1 estimator) for large tables to avoid the O(n) full scan. Exact counting is correct and simpler for the typical table sizes of an embedded database. Sampling is planned for a future statistics phase when tables exceed 1 M rows.

`axiom_foreign_keys` row format (`FkDef`)

[fk_id:          4 bytes LE u32]
[child_table_id: 4 bytes LE u32]   — table with the FK column
[child_col_idx:  2 bytes LE u16]   — FK column index in child table
[parent_table_id:4 bytes LE u32]   — referenced (parent) table
[parent_col_idx: 2 bytes LE u16]   — referenced column in parent table
[on_delete:      1 byte  u8   ]    — 0=NoAction, 1=Restrict, 2=Cascade, 3=SetNull
[on_update:      1 byte  u8   ]    — same encoding
[fk_index_id:    4 bytes LE u32]   — 0 = user-provided index (not auto-created)
[name_len:       4 bytes LE u32]
[name:           name_len bytes UTF-8]

FkAction encoding: 0 = NoAction, 1 = Restrict, 2 = Cascade, 3 = SetNull, 4 = SetDefault.

fk_index_id = 0 means the FK column already had a user-provided index; the FK did not auto-create one and will not drop one on DROP CONSTRAINT.

`axiom_indexes` — predicate extension (Phase 6.7)

The IndexDef binary format was extended in Phase 6.7 with a backward-compatible predicate section appended after the columns:

[...existing fields...][ncols:1][col_idx:2, order:1]×ncols
[pred_len:2 LE][pred_sql: pred_len UTF-8 bytes]   ← absent on pre-6.7 rows

pred_len = 0 (or section absent) → full index. Pre-6.7 databases open without migration because from_bytes checks bytes.len() > consumed before reading the predicate section.

CatalogBootstrap

CatalogBootstrap is a one-time procedure that runs when open() encounters an empty database file (or a file with the meta page uninitialized).

Bootstrap Sequence

1. Allocate page 0 (Meta page).
   Write format_version, zero for catalog_root_page, freelist_root_page, etc.

2. Allocate the freelist root page.
   Initialize the bitmap (all pages allocated so far are marked used).
   Write freelist_root_page into the meta page.

3. Allocate heap roots for catalog tables and aux heaps:
   `axiom_tables`, `axiom_columns`, `axiom_indexes`, `axiom_constraints`,
   `axiom_foreign_keys`, `axiom_stats`, `axiom_databases`, `axiom_table_databases`.

4. Insert the default database row `axiomdb` into `axiom_databases`.

5. Persist every root page id into the meta page.

6. Flush pages and WAL.

Fresh bootstrap uses txn_id = 0 for the default database row because no user transaction exists yet. If a pre-22b.3a database is reopened, ensure_database_roots upgrades it in-place and inserts axiomdb exactly once.

CatalogReader

CatalogReader provides read-only access to the catalog from any component that needs schema information (primarily the SQL analyzer).

#![allow(unused)]
fn main() {
pub struct CatalogReader<'a> {
    storage:  &'a dyn StorageEngine,
    snapshot: TransactionSnapshot,
}

impl<'a> CatalogReader<'a> {
    /// List all user tables visible to this snapshot.
    pub fn list_tables(&mut self, schema: &str) -> Result<Vec<TableDef>, DbError>;

    /// List all logical databases visible to this snapshot.
    pub fn list_databases(&mut self) -> Result<Vec<DatabaseDef>, DbError>;

    /// Find a specific table by schema + name.
    pub fn get_table(&mut self, schema: &str, name: &str) -> Result<Option<TableDef>, DbError>;

    /// Find a specific table by database + schema + name.
    pub fn get_table_in_database(
        &mut self,
        database: &str,
        schema: &str,
        name: &str,
    ) -> Result<Option<TableDef>, DbError>;

    /// List columns for a table in declaration order.
    pub fn list_columns(&mut self, table_id: u64) -> Result<Vec<ColumnDef>, DbError>;

    /// List indexes for a table.
    pub fn list_indexes(&mut self, table_id: u64) -> Result<Vec<IndexDef>, DbError>;
}
}

The snapshot parameter ensures catalog reads are MVCC-consistent. A DDL statement that has not yet committed is invisible to other transactions’ CatalogReader.

Effective database resolution

Catalog lookup is now two-dimensional:

(database, schema, table)

The resolver applies one legacy rule:

if no explicit table->database binding exists:
    effective database = "axiomdb"

That rule is what lets old databases keep working without rewriting existing TableDef rows.

Schema Types

#![allow(unused)]
fn main() {
pub struct TableDef {
    pub id:             u32,
    pub root_page_id:   u64,    // heap root or clustered-tree root
    pub storage_layout: TableStorageLayout,
    pub schema_name:    String,
    pub table_name:     String,
    pub schema_version: u64,    // monotonic counter for plan cache invalidation (Phase 40.2)
}

pub enum TableStorageLayout {
    Heap = 0,
    Clustered = 1,
}

// On-disk format for axiom_tables rows (3 generations, all backward-compatible):
//
// v0 (legacy, no trailing bytes):
//   [table_id:4 LE][root_page_id:8 LE][schema_len:1][schema UTF-8][name_len:1][name UTF-8]
//   → storage_layout = Heap, schema_version = 1
//
// v1 (1 trailing byte):
//   ... [layout:1]
//   → layout from byte, schema_version = 1
//
// v2 (9 trailing bytes, current):
//   ... [layout:1][schema_version:8 LE]
//   → layout and schema_version from bytes
//
// `schema_version` is initialized to 1 at table creation. It is bumped by:
// CREATE INDEX, DROP INDEX, ALTER TABLE (any op), DROP TABLE, TRUNCATE TABLE.
// Plans whose deps include (table_id, old_version) detect staleness on next
// lookup without scanning the entire plan cache (Phase 40.2 OID invalidation).

pub struct ColumnDef {
    pub table_id:      u64,
    pub col_index:     usize,       // zero-based position within the table
    pub col_name:      String,
    pub data_type:     DataType,    // from axiomdb-core::types::DataType
    pub not_null:      bool,
    pub default_value: Option<String>,  // DEFAULT expression as source text
}

pub struct IndexDef {
    pub id:           u64,
    pub table_id:     u64,
    pub index_name:   String,
    pub is_unique:    bool,
    pub is_primary:   bool,
    pub columns:      Vec<String>,  // indexed column names in key order
    pub root_page_id: u64,          // B+ Tree root, or clustered table root for PRIMARY KEY metadata
}
}

⚙️

Design Decision — Generic Table Roots `TableDef` no longer hard-codes a heap root because Phase 39.13 makes explicit-`PRIMARY KEY` tables clustered from day one. This follows SQLite `WITHOUT ROWID` more closely than the easier InnoDB-style hidden-key shortcut, which would have preserved the old heap assumption at the cost of reopening the storage rewrite later.

DDL Mutations Through the Catalog

When the executor processes CREATE TABLE, it:

Opens a write transaction (or participates in the current one).
Allocates a new TableId from the meta page sequence.
Chooses the table layout:
- no explicit PRIMARY KEY → Heap
- explicit PRIMARY KEY → Clustered
Allocates the primary row-store root page:
- Heap → PageType::Data
- Clustered → PageType::ClusteredLeaf
Inserts a row into axiom_tables with {id, root_page_id, storage_layout, schema_name, table_name}.
Inserts one row per column into axiom_columns.
Persists index metadata:
- clustered tables reuse table.root_page_id for the logical PRIMARY KEY index row
- UNIQUE secondary indexes still allocate ordinary PageType::Index roots
Appends all these mutations to the WAL.
Commits (or defers the commit to the surrounding transaction).

The root_page_id stored in axiom_tables is now the single entry point for the table’s primary row store. Heap DML still uses it as the heap-chain root today; clustered INSERT / SELECT now use it as the clustered row-store root, while heap-only executor paths still reject clustered UPDATE / DELETE instead of touching the wrong page format.

Because the catalog is stored in heap pages and indexed like any other table, all crash recovery mechanisms apply automatically: WAL replay will reconstruct the catalog state after a crash in the middle of CREATE TABLE, just as it would reconstruct any other table mutation.

Catalog Page Organization

Page 0:      Meta page (format_version, catalog_root_page, freelist_root_page, ...)
Page 1:      FreeList bitmap root
Pages 2–N:   B+ Tree pages for axiom_tables
Pages N+1–M: Heap pages for axiom_tables row data
Pages M+1–P: B+ Tree pages for axiom_columns
...
Pages P+1–Q: User table data begins here

The exact page assignments depend on database growth. Page 0 always remains the meta page. All other page assignments are dynamic — the freelist tracks which pages are in use, and the meta page records the root page IDs for each catalog B+ Tree.

Catalog Invariants

The following invariants must hold at all times. The startup verifier in axiomdb-sql::index_integrity now re-checks the index-related ones after WAL recovery and before server or embedded mode starts serving traffic:

Every table listed in axiom_tables has at least one row in axiom_columns.
Every column in axiom_columns references a table_id that exists in axiom_tables.
Every index in axiom_indexes references a table_id that exists in axiom_tables.
Every non-clustered root_page_id in axiom_indexes points to a page of type Index.
A clustered table’s PRIMARY KEY metadata row in axiom_indexes reuses the table root_page_id and therefore may point to ClusteredLeaf / ClusteredInternal.
Every column listed in an index definition exists in the referenced table.
No two tables in the same schema have the same name.
No two indexes on the same table have the same name.

Startup index integrity verification

For every catalog-visible heap table:

enumerate the expected entries from heap-visible rows
enumerate the actual B+ Tree entries from root_page_id
compare them exactly
if the tree is readable but divergent, rebuild a fresh root from heap
rotate the catalog root in a WAL-protected transaction
defer free of the old tree pages until commit durability is confirmed

Clustered tables are skipped for now because their logical PRIMARY KEY metadata no longer points at a classic B+ Tree root. If a heap-side tree cannot be traversed safely, open fails with IndexIntegrityFailure. The database does not enter a best-effort serving mode with an untrusted index.

⚙️

Design Decision — Heap As Source Of Truth Like SQLite's REINDEX, AxiomDB rebuilds a readable divergent index from heap rows instead of trying to patch arbitrary leaf-level damage in place. This keeps recovery logic small and makes the catalog root swap the only logical state transition.

Row Codec

The row codec converts between &[Value] (the in-memory representation used by the executor) and &[u8] (the on-disk binary format stored in heap pages). The codec is in axiomdb-types::codec.

Binary Format

┌──────────────────────────────────────────────────────────────────┐
│ null_bitmap: ceil(n_cols / 8) bytes                              │
│   bit i = (bitmap[i/8] >> (i%8)) & 1 == 1  →  column i is NULL  │
├──────────────────────────────────────────────────────────────────┤
│ For each non-NULL column, in column declaration order:           │
│   Bool             →  1 byte   (0x00 = false, 0x01 = true)      │
│   Int, Date        →  4 bytes  little-endian i32                │
│   BigInt, Real     →  8 bytes  little-endian i64 / f64          │
│   Timestamp        →  8 bytes  little-endian i64 (µs UTC)       │
│   Decimal          → 16 bytes  little-endian i128 mantissa       │
│                    +  1 byte   u8 scale                          │
│   Uuid             → 16 bytes  as-is (big-endian by convention)  │
│   Text, Bytes      →  3 bytes  u24 LE length prefix              │
│                    + length bytes  raw UTF-8 / raw bytes         │
└──────────────────────────────────────────────────────────────────┘

NULL columns are indicated only in the null bitmap. No bytes are written for NULL values in the payload section. This means:

A row with all columns NULL (and a null bitmap) encodes to ceil(n_cols/8) bytes.
A row with no NULL columns encodes to ceil(n_cols/8) bytes (all zero bitmap) plus the sum of each column’s fixed width or variable-length payload.

Column Type Sizes

Value variant	SQL type	Encoded size
`Bool`	BOOL, BOOLEAN	1 byte
`Int`	INT, INTEGER	4 bytes
`BigInt`	BIGINT	8 bytes
`Real`	REAL, DOUBLE	8 bytes (f64, IEEE 754)
`Decimal(m,s)`	DECIMAL, NUMERIC	17 bytes (16 i128 + 1 scale)
`Uuid`	UUID	16 bytes
`Date`	DATE	4 bytes (i32 days)
`Timestamp`	TIMESTAMP	8 bytes (i64 µs UTC)
`Text`	TEXT, VARCHAR, CHAR	3 + len bytes
`Bytes`	BYTEA, BLOB	3 + len bytes

Null Bitmap

The null bitmap occupies ceil(n_cols / 8) bytes at the start of every encoded row. The bits are packed little-endian: bit 0 of byte 0 corresponds to column 0, bit 1 of byte 0 to column 1, …, bit 0 of byte 1 to column 8, and so on.

n_cols = 5  →  1 byte  (bits 5–7 are unused and always 0)
n_cols = 8  →  1 byte  (all 8 bits used)
n_cols = 9  →  2 bytes (bit 0 of byte 1 = column 8)
n_cols = 64 →  8 bytes
n_cols = 65 →  9 bytes

Reading column i:

#![allow(unused)]
fn main() {
let bit = (bitmap[i / 8] >> (i % 8)) & 1;
let is_null = bit == 1;
}

Setting column i as NULL:

#![allow(unused)]
fn main() {
bitmap[i / 8] |= 1 << (i % 8);
}

This design saves 7 bytes per nullable column compared to wrapping each value in Option<T> (which adds a full word of overhead in Rust’s memory layout).

⚙️

Design Decision — Packed Null Bitmap 1 bit per column instead of 1 byte. For a table with 16 nullable columns, that is 15 bytes saved per row vs. a byte-per-column scheme. At 100M rows, that is 1.5 GB of disk savings — plus proportionally faster range scans (fewer bytes to read per row).

Why u24 for Variable-Length Fields

The length prefix for Text and Bytes is 3 bytes (a u24 in little-endian). This covers strings up to 16,777,215 bytes (~16 MB). The codec enforces this limit with DbError::ValueTooLarge.

Why not u32 (4 bytes)?

The codec has two independent size limits:

Codec limit (u24): Text/Bytes may not exceed 16,777,215 bytes per value.
Storage limit (~16 KB): An encoded row must fit within MAX_TUPLE_DATA, which is approximately PAGE_BODY_SIZE - RowHeader_size - SlotEntry_size.

In practice, a single row almost never approaches 16 MB (the codec limit). If it did, it would far exceed the storage limit and be rejected by the heap layer anyway. Using u24 saves 1 byte per string column — for a table with 10 text columns, every row is 10 bytes smaller. At 100 million rows, that is 1 GB of disk savings.

The u24 also signals that future TOAST (out-of-line storage for large values) will take over before values approach 16 MB — TOAST is planned for Phase 6.

⚙️

Design Decision — u24 Length Prefix Saving 1 byte per text/bytes column is significant at scale: a table with 10 text columns × 100M rows saves 1 GB of disk and proportionally faster I/O. The 16 MB per-value ceiling is intentional — values above ~16 KB will use TOAST (Phase 6) long before reaching it, making the u32 range unused in practice.

Why i128 for DECIMAL

DECIMAL values are represented as (mantissa: i128, scale: u8). The actual value is mantissa × 10^(-scale).

Decimal(123456789, 2)  →  1,234,567.89
Decimal(-199, 2)       →  -1.99
Decimal(0, 0)          →  0

i128 provides 38 significant decimal digits, which matches DECIMAL(38, s) — the maximum precision supported by most SQL databases including PostgreSQL and SQL Server.

The alternative, rust_decimal::Decimal, packs the same i128 internally but adds struct overhead and a dependency. The AxiomDB codec stores the i128 mantissa and scale byte directly, with no intermediary struct.

encoded_len — O(n) Without Allocation

encoded_len(values, types) computes the exact byte count that encode_row would produce, without allocating a buffer.

#![allow(unused)]
fn main() {
pub fn encoded_len(values: &[Value], types: &[DataType]) -> usize {
    let bitmap_bytes = values.len().div_ceil(8);
    let payload: usize = values.iter().zip(types.iter())
        .filter(|(v, _)| !v.is_null())
        .map(|(v, dt)| fixed_size(dt) + variable_overhead(v))
        .sum();
    bitmap_bytes + payload
}
}

This is used by the heap insertion path to check whether the encoded row fits in the remaining free space on the target page — without actually encoding it first.

encode_row — Single Pass, No Intermediate Buffer

#![allow(unused)]
fn main() {
pub fn encode_row(values: &[Value], types: &[DataType]) -> Result<Vec<u8>, DbError>;
}

The encoder makes one pass over the columns:

Writes the null bitmap (all zero initially).
For each column, if the value is Value::Null, sets the corresponding bitmap bit. Otherwise, type-checks the value against the declared type and appends the encoded bytes.
Returns the complete Vec<u8>.

The type check step catches programmer errors early (e.g., passing Value::Text for a column declared DataType::Int). It returns DbError::TypeMismatch rather than writing corrupted bytes.

decode_row — Position-Tracking Cursor

#![allow(unused)]
fn main() {
pub fn decode_row(bytes: &[u8], types: &[DataType]) -> Result<Vec<Value>, DbError>;
}

The decoder walks bytes with a position cursor:

Reads the null bitmap from the first ceil(n_cols/8) bytes.
For each column in order:
- If the corresponding bitmap bit is 1 → push Value::Null.
- Otherwise, read the fixed or variable-length bytes for the declared type, construct the Value, advance the cursor.
Returns Err(DbError::ParseError) if the buffer is shorter than expected (truncated row — indicates storage corruption).

Example — Encoding a Users Row

Schema: users(id BIGINT, name TEXT, age INT, email TEXT, active BOOL)
Values: [BigInt(42), Text("Alice"), Int(30), Null, Bool(true)]

Step 1: null_bitmap = ceil(5/8) = 1 byte
        col 3 (email) is NULL → bit 3 of byte 0 → bitmap = 0b00001000 = 0x08

Step 2: encode non-NULL values:
        col 0 (BigInt(42))     → 8 bytes: 2A 00 00 00 00 00 00 00
        col 1 (Text("Alice"))  → 3 bytes length: 05 00 00
                               + 5 bytes payload: 41 6C 69 63 65
        col 2 (Int(30))        → 4 bytes: 1E 00 00 00
        col 3 (NULL)           → 0 bytes (indicated by bitmap)
        col 4 (Bool(true))     → 1 byte: 01

Final encoding (19 bytes total):
  [08] [2A 00 00 00 00 00 00 00] [05 00 00] [41 6C 69 63 65] [1E 00 00 00] [01]
   ^     bigint 42                  ^len=5    "Alice"            int 30       true
   bitmap: col 3 is NULL

encoded_len for this row would return 19 without allocating any buffer.

NaN Constraint

Value::Real(f64::NAN) is a valid Rust value but is forbidden by the codec. encode_row returns DbError::InvalidValue when it encounters NaN.

This is enforced because:

SQL semantics require NaN <> NaN to be UNKNOWN, not FALSE.
Storing NaN in the database would make equality comparisons unpredictable.
IEEE 754 defines NaN as not-a-number — it is a sentinel, not a data value.

Code that constructs Value::Real must ensure the f64 is not NaN before passing it to the codec. The executor’s arithmetic operations must propagate NaN as NULL.

Type Coercion (axiomdb-types::coerce)

The axiomdb-types::coerce module implements implicit type conversion. It is separate from the codec: the codec only serializes well-typed Values; coercion happens before encoding, at expression evaluation and column assignment time.

Two entry points

`coerce(value, target: DataType, mode: CoercionMode) -> Result<Value, DbError>`

Used by the executor on INSERT and UPDATE to convert a supplied value to the declared column type. Examples:

coerce(Text("42"), DataType::Int, Strict) → Ok(Int(42))
coerce(Int(7), DataType::BigInt, Strict) → Ok(BigInt(7))
coerce(Date(1), DataType::Timestamp, Strict) → Ok(Timestamp(86_400_000_000))
coerce(Null, DataType::Int, Strict) → Ok(Null) — NULL always passes through

`coerce_for_op(l, r) -> Result<(Value, Value), DbError>`

Used by the expression evaluator in eval_binary to promote two operands to a common type before arithmetic or comparison. Does not accept a CoercionMode — operator widening is always deterministic and does not attempt Text→numeric parsing.

coerce_for_op(Int(5), Real(1.5)) → (Real(5.0), Real(1.5))
coerce_for_op(Int(2), Decimal(314, 2)) → (Decimal(200, 2), Decimal(314, 2)) — Int is scaled by 10^scale so it has the same unit as the Decimal mantissa

CoercionMode

#![allow(unused)]
fn main() {
pub enum CoercionMode {
    Strict,      // AxiomDB default — '42abc'→INT = error
    Permissive,  // MySQL compat — '42abc'→INT = 42 (stops at first non-digit)
}
}

Complete conversion matrix

The full set of implicit conversions supported by coerce():

From	To	Rule
Any	same type	Identity — returned unchanged
`NULL`	any	Returns `NULL`
`Int(n)`	`BigInt`	`BigInt(n as i64)` — lossless
`Int(n)`	`Real`	`Real(n as f64)` — may lose precision for large values
`Int(n)`	`Decimal`	`Decimal(n, 0)` — lossless
`BigInt(n)`	`Int`	Range check: error if `n ∉ [i32::MIN, i32::MAX]`
`BigInt(n)`	`Real`	`Real(n as f64)`
`BigInt(n)`	`Decimal`	`Decimal(n, 0)`
`Text(s)`	`Int`	Parse full string as integer (strict) or leading digits (permissive)
`Text(s)`	`BigInt`	Same as Int but target is i64
`Text(s)`	`Real`	Parse as f64; NaN/Inf are always rejected
`Text(s)`	`Decimal`	Parse as `[-][int][.][frac]`; scale = fraction digit count
`Date(d)`	`Timestamp`	`d * 86_400_000_000` µs — midnight UTC
`Bool(b)`	`Int/BigInt/Real`	Permissive mode only: `true→1`, `false→0`
everything else		`DbError::InvalidCoercion` (SQLSTATE 22018)

Text → integer parsing rules in detail

Strict mode (AxiomDB default):

Strip leading/trailing ASCII whitespace.
Parse the entire remaining string as a decimal integer (optional leading -/+).
Any non-digit character after the optional sign → InvalidCoercion.
Overflow (value does not fit in target type) → InvalidCoercion.

Permissive mode (MySQL compat):

Strip whitespace.
Read optional sign.
Consume as many leading ASCII digit characters as possible.
If zero digits consumed → return 0 (e.g., "abc" → 0).
Parse accumulated digits; overflow → InvalidCoercion (not silently clamped).

Date → Timestamp conversion

Date stores days since 1970-01-01 as i32. Timestamp stores microseconds since 1970-01-01 UTC as i64.

Timestamp = Date × 86_400_000_000
          = days × 86400 seconds/day × 1_000_000 µs/second

Day 0 = 1970-01-01T00:00:00Z = Timestamp 0. Negative days produce negative Timestamps (dates before the Unix epoch). The multiplication uses checked_mul — overflow is impossible for any plausible calendar date but is handled defensively.

Int → Decimal scale adoption in coerce_for_op

When coerce_for_op promotes an Int or BigInt to match a Decimal, it uses the Decimal operand’s existing scale so that the result is expressed in the same unit:

coerce_for_op(Int(5), Decimal(314, 2)):
  factor = 10^2 = 100
  Int(5) → Decimal(5 × 100, 2) = Decimal(500, 2)
  → (Decimal(500, 2), Decimal(314, 2))

eval_arithmetic(Add, Decimal(500, 2), Decimal(314, 2)):
  → Decimal(814, 2)  = 8.14  ✓

Without scale adoption, 5 + 3.14 would compute Decimal(5 + 314, 2) = Decimal(319, 2) = 3.19 — wrong.

SQL Parser

The SQL parser lives in axiomdb-sql and is split into three stages: lexer (string → tokens), parser (tokens → AST), and semantic analyzer (AST → validated AST with resolved column indices). This page covers the lexer and parser. The semantic analyzer is documented in Semantic Analyzer.

Why logos, Not nom

AxiomDB uses the logos crate to generate the lexer, rather than nom combinators or hand-written code.

Criterion	logos	nom
Compilation model	Compiles patterns to DFA at build time	Constructs parsers at runtime
Token scan cost	O(n), 1–3 instructions/byte	O(n), higher constant factor
Heap allocations	Zero (identifiers are `&'src str`)	Possible in combinators
Case-insensitive keys	`ignore(ascii_case)` attribute	Manual lowercasing pass needed
Error messages	Byte offsets built-in	Requires manual tracking

Benchmark result: AxiomDB’s lexer achieves 9–17× higher throughput than sqlparser-rs (which uses nom internally) for the same SQL inputs. The advantage holds across simple SELECT, complex multi-join SELECT, and DDL statements.

🚀

9–17× Faster Than the Production Standard sqlparser-rs is the SQL parser used by Apache Arrow DataFusion, Delta Lake, and InfluxDB. The DFA advantage is structural: logos compiles all keyword patterns into a single transition matrix at build time. Processing each character is one table lookup — nom combinators perform dynamic dispatch and build intermediate allocations for each combinator step.

The primary reason is the DFA: logos compiles all keyword patterns into a single Deterministic Finite Automaton at compile time. Processing each character is a table lookup in a pre-computed transition matrix — constant time per character with a very small constant. nom combinators perform dynamic dispatch and allocate intermediate results.

Lexer Design

Zero-Copy Tokens

Identifiers and quoted identifiers are represented as &'src str — slices into the original SQL string. No heap allocation occurs during lexing for identifiers.

Only StringLit allocates a String, because escape sequence processing (\', \\, \n) transforms the content in place and cannot be zero-copy.

#![allow(unused)]
fn main() {
pub struct SpannedToken<'src> {
    pub token: Token<'src>,
    pub span: Span,          // byte offsets (start, end) in the original string
}
}

The lifetime 'src ensures that token slices cannot outlive the input string.

Token Enum

The Token<'src> enum has approximately 85 variants:

#![allow(unused)]
fn main() {
pub enum Token<'src> {
    // DML keywords (case-insensitive)
    Select, From, Where, Insert, Into, Values, Update, Set, Delete,
    // DDL keywords
    Create, Database, Databases, Table, Index, Drop, Alter, Add, Column, Constraint,
    // Transaction keywords
    Begin, Commit, Rollback, Savepoint, Release,
    // Session / introspection
    Use,
    // Data types
    Bool, Boolean, TinyInt, SmallInt, Int, Integer, BigInt, HugeInt,
    Real, Float, Double, Decimal, Numeric, Char, VarChar, Text, Bytea, Blob,
    Date, Time, Timestamp, Uuid, Json, Jsonb, Vector,
    // Clause keywords
    Join, Inner, Left, Right, Cross, On, Using,
    Group, By, Having, Order, Asc, Desc, Nulls, First, Last,
    Limit, Offset, Distinct, All,
    // Constraint keywords
    Primary, Key, Unique, Not, Null, Default, References, Check,
    Auto, Increment, Serial, Bigserial, Foreign, Cascade, Restrict, NoAction,
    // Logical operators
    And, Or,
    // Functions
    Is, In, Between, Like, Ilike, Exists, Case, When, Then, Else, End,
    Coalesce, NullIf,
    // Identifier variants
    Ident(&'src str),           // unquoted identifier
    QuotedIdent(&'src str),     // backtick-quoted `identifier`
    DqIdent(&'src str),         // double-quote "identifier"
    // Literals
    IntLit(i64), FloatLit(f64), StringLit(String), HexLit(Vec<u8>),
    TrueLit, FalseLit, NullLit,
    // Punctuation
    LParen, RParen, Comma, Semicolon, Dot, Star, Eq, Ne, Lt, Le, Gt, Ge,
    Plus, Minus, Slash, Percent, Bang, BangEq, Arrow, FatArrow,
    // Sentinel
    Eof,
}
}

Keyword Priority Over Identifiers

logos resolves ambiguities by matching keywords before identifiers. The rule is: longer matches take priority; if lengths are equal, keywords take priority over Ident. This is expressed in logos as:

#![allow(unused)]
fn main() {
#[token("SELECT", ignore(ascii_case))]
Select,

#[regex(r"[A-Za-z_][A-Za-z0-9_]*")]
Ident(&'src str),
}

SELECT, select, and Select all produce Token::Select, not Token::Ident. A hypothetical column named select must be escaped: `select` or "select".

Comment Stripping

All three MySQL-compatible comment styles are skipped automatically:

-- single-line comment (SQL standard)
# single-line comment  (MySQL extension)
/* block comment */

fail-fast Limits

tokenize(sql, max_bytes) checks the SQL length before scanning. If sql.len() > max_bytes, it returns DbError::ParseError immediately without touching the DFA. This protects against memory exhaustion from maliciously large queries.

Parser Design

The parser is a hand-written recursive descent parser. It does not use any parser combinator library — the grammar is simple enough that combinators would add overhead without benefit.

Parser State

#![allow(unused)]
fn main() {
struct Parser<'src> {
    tokens: Vec<SpannedToken<'src>>,
    pos: usize,
}

impl<'src> Parser<'src> {
    fn peek(&self) -> &Token<'src>;         // current token, no advance
    fn advance(&mut self) -> &Token<'src>;  // consume and return current token
    fn expect(&mut self, t: &Token) -> Result<(), DbError>;  // consume or error
    fn eat(&mut self, t: &Token) -> bool;   // consume if matching, else false
}
}

Grammar — LL(1) for DDL, LL(2) for DML

Most DDL productions are LL(1): the first token uniquely determines the production. Some DML productions require one lookahead token:

SELECT * FROM t vs SELECT a, b FROM t — the parser sees SELECT then peeks at the next token to decide whether to parse * or a projection list.
INSERT INTO t VALUES (...) vs INSERT INTO t SELECT ... — after consuming INTO t, peek determines whether to parse a VALUES list or a sub-SELECT.

Expression Precedence

The expression sub-parser implements the standard precedence chain using separate functions for each precedence level. This is equivalent to a Pratt parser without the extra machinery:

parse_expr()           (entry point — calls parse_or)
  parse_or()           OR
    parse_and()        AND
      parse_not()      unary NOT
        parse_is_null()    IS NULL / IS NOT NULL
          parse_predicate()  =, <>, !=, <, <=, >, >=, BETWEEN, LIKE, IN
            parse_addition()  + and -
              parse_multiplication()  *, /, %
                parse_unary()  unary minus -x
                  parse_atom()  literal, column ref, function call, subexpr

Each level calls the next level to parse its right-hand side, naturally implementing left-to-right associativity and the correct precedence hierarchy.

DDL Grammar Sketch

stmt → select_stmt | insert_stmt | update_stmt | delete_stmt
     | create_database_stmt | drop_database_stmt | use_stmt
     | create_table_stmt | create_index_stmt
     | drop_table_stmt | drop_index_stmt
     | alter_table_stmt | truncate_stmt
     | show_tables_stmt | show_databases_stmt | show_columns_stmt
     | begin_stmt | commit_stmt | rollback_stmt | savepoint_stmt

create_database_stmt →
  CREATE DATABASE ident

drop_database_stmt →
  DROP DATABASE [IF EXISTS] ident

use_stmt →
  USE ident

create_table_stmt →
  CREATE TABLE [IF NOT EXISTS] ident
  LPAREN column_def_list [COMMA table_constraint_list] RPAREN

column_def →
  ident type_name [column_constraint...]

column_constraint →
    NOT NULL
  | DEFAULT expr
  | PRIMARY KEY
  | UNIQUE
  | AUTO_INCREMENT | SERIAL | BIGSERIAL
  | REFERENCES ident LPAREN ident RPAREN [on_action] [on_action]
  | CHECK LPAREN expr RPAREN

table_constraint →
    PRIMARY KEY LPAREN ident_list RPAREN
  | UNIQUE LPAREN ident_list RPAREN
  | FOREIGN KEY LPAREN ident_list RPAREN REFERENCES ident LPAREN ident_list RPAREN
  | CHECK LPAREN expr RPAREN
  | CONSTRAINT ident (primary_key | unique | foreign_key | check)

truncate_stmt →
  TRUNCATE TABLE ident

show_tables_stmt →
  SHOW TABLES [FROM ident]

show_databases_stmt →
  SHOW DATABASES

show_columns_stmt →
  SHOW COLUMNS FROM ident
  | DESCRIBE ident
  | DESC ident

⚙️

Design Decision — No Half Grammar AxiomDB now parses CREATE/DROP DATABASE, USE, and SHOW DATABASES, but it still rejects database.schema.table. MySQL allows a database qualifier directly in table references; AxiomDB intentionally deferred that grammar until the analyzer and executor can honor it end-to-end instead of shipping a misleading parser-only approximation.

SHOW / DESCRIBE Parsing

SHOW is a dedicated keyword (Token::Show). After consuming it, the parser peeks at the next token to dispatch:

parse_show():
  consume Show
  if peek = Databases:
    advance
    return Stmt::ShowDatabases(ShowDatabasesStmt)
  if peek = Ident("TABLES") | Ident("tables"):   // COLUMNS is not a reserved keyword
    advance
    schema = if eat(From): parse_ident() else "public"
    return Stmt::ShowTables(ShowTablesStmt { schema })
  if peek = Ident("COLUMNS") | Ident("columns"):
    advance; expect(From); table = parse_ident()
    return Stmt::ShowColumns(ShowColumnsStmt { table_name: table })
  else:
    return Err(ParseError { "expected TABLES, DATABASES, or COLUMNS after SHOW" })

DESCRIBE and DESC are both tokenized as Token::Describe (the lexer aliases both spellings to the same token). The parser dispatches them directly to the ShowColumns AST node:

parse_stmt():
  ...
  Token::Describe => {
    advance; table = parse_ident()
    return Stmt::ShowColumns(ShowColumnsStmt { table_name: table })
  }
  ...

COLUMNS is not a reserved keyword in AxiomDB — a column or table named columns does not need quoting. The parser matches it by comparing the identifier string after lowercasing, not by token variant.

TRUNCATE Parsing

TRUNCATE is tokenized as Token::Truncate. After consuming it, the parser expects the literal keyword TABLE (also a reserved token) and then the table name:

parse_truncate():
  consume Truncate
  expect(Table)
  table_name = parse_ident()
  return Stmt::Truncate(TruncateTableStmt { table_name })

SELECT Grammar Sketch

select_stmt →
  SELECT [DISTINCT] select_list
  FROM table_ref [join_clause...]
  [WHERE expr]
  [GROUP BY expr_list]
  [HAVING expr]
  [ORDER BY order_item_list]
  [LIMIT int_lit [OFFSET int_lit]]

select_list → STAR | select_item (COMMA select_item)*
select_item → expr [AS ident]

table_ref → ident [AS ident]

join_clause →
  [INNER | LEFT [OUTER] | RIGHT [OUTER] | CROSS]
  JOIN table_ref join_condition

join_condition → ON expr | USING LPAREN ident_list RPAREN

order_item → expr [ASC | DESC] [NULLS (FIRST | LAST)]

Subquery Parsing

Subqueries are parsed at three different points in the expression grammar, each corresponding to a different syntactic form.

Scalar Subqueries — `parse_atom`

parse_atom is the lowest-precedence entry point for all atoms: literals, column references, function calls, and parenthesised expressions. When parse_atom encounters an LParen, it peeks at the next token. If it is Select, it parses a full select_stmt recursively and wraps it in Expr::Subquery(Box<SelectStmt>). Otherwise, it parses the contents as a grouped expression (expr).

parse_atom():
  if peek = LParen:
    if peek+1 = Select:
      advance; stmt = parse_select_stmt(); expect(RParen)
      return Expr::Subquery(stmt)
    else:
      advance; e = parse_expr(); expect(RParen)
      return e
  ...

This means (SELECT MAX(id) FROM t) is valid anywhere an expression is valid: SELECT list, WHERE, HAVING, ORDER BY, even nested inside function calls.

IN Subquery — `parse_predicate`

parse_predicate handles comparison operators and the IN / NOT IN forms. After detecting the In or Not In tokens, the parser checks whether the next token is LParen followed by Select. If so, it parses a subquery and produces Expr::InSubquery { expr, subquery, negated }. If not, it falls through to the normal IN (val1, val2, ...) list form.

parse_predicate():
  lhs = parse_addition()
  if peek = Not:
    advance; expect(In); negated = true
  else if peek = In:
    advance; negated = false
  else: return lhs  // comparison ops handled here too

  expect(LParen)
  if peek = Select:
    stmt = parse_select_stmt(); expect(RParen)
    return Expr::InSubquery { expr: lhs, subquery: stmt, negated }
  else:
    values = parse_expr_list(); expect(RParen)
    return Expr::InList { expr: lhs, values, negated }

EXISTS / NOT EXISTS — `parse_not`

parse_not handles unary NOT. When the parser sees Exists (or Not Exists), it consumes the token, expects LParen, recursively parses a select_stmt, and returns Expr::Exists { subquery, negated }. The result is always boolean — the SELECT list contents are irrelevant at the execution level.

parse_not():
  if peek = Not:
    advance
    if peek = Exists:
      advance; expect(LParen); stmt = parse_select_stmt(); expect(RParen)
      return Expr::Exists { subquery: stmt, negated: true }
    else:
      return Expr::Not(parse_is_null())
  if peek = Exists:
    advance; expect(LParen); stmt = parse_select_stmt(); expect(RParen)
    return Expr::Exists { subquery: stmt, negated: false }
  return parse_is_null()

Derived Tables — `parse_table_ref`

parse_table_ref parses the FROM clause. When it encounters LParen (without a prior identifier), it recursively parses a select_stmt, expects RParen, and then requires an AS alias clause (the alias is mandatory for derived tables):

parse_table_ref():
  if peek = LParen:
    advance; stmt = parse_select_stmt(); expect(RParen)
    expect(As); alias = parse_ident()
    return TableRef::Derived { subquery: stmt, alias }
  else:
    name = parse_ident(); alias = optional AS ident
    return TableRef::Named { name, alias }

AST Nodes for Subqueries

#![allow(unused)]
fn main() {
pub enum Expr {
    // A scalar subquery — returns one value (or NULL if no rows)
    Subquery(Box<SelectStmt>),

    // IN (SELECT ...) or NOT IN (SELECT ...)
    InSubquery {
        expr:     Box<Expr>,
        subquery: Box<SelectStmt>,
        negated:  bool,
    },

    // EXISTS (SELECT ...) or NOT EXISTS (SELECT ...)
    Exists {
        subquery: Box<SelectStmt>,
        negated:  bool,
    },

    // Outer column reference (used inside correlated subqueries)
    OuterColumn {
        col_idx: usize,
        depth:   u32,    // 1 = immediate outer query
    },

    // ... other variants unchanged
}

pub enum TableRef {
    Named   { name: String, alias: Option<String> },
    Derived { subquery: Box<SelectStmt>, alias: String },
}
}

Correlated Column Resolution — Semantic Analyzer

Correlated subqueries introduce Expr::OuterColumn during semantic analysis (analyze()), not during parsing. The semantic analyzer maintains a stack of BindContext frames, one per query level. When a column reference inside a subquery cannot be resolved against the inner context, the analyzer walks up the stack and resolves it against the outer context, replacing the Expr::Column with Expr::OuterColumn { col_idx, depth: 1 }.

This means the parser always produces Expr::Column for every column reference; OuterColumn only appears in the analyzed AST, never in the raw parse output.

⚙️

Design Decision — Parse-Time vs Analyze-Time Correlation Correlation detection is deferred to the semantic analyzer rather than the parser. The parser always emits Expr::Column for every column reference, regardless of nesting depth. This keeps the parser stateless and context-free. The semantic analyzer's BindContext stack then resolves ambiguity with full schema knowledge. This is the same split used by PostgreSQL's parser/analyzer boundary: the parser builds a syntactic tree; the analyzer attaches semantic meaning (column indices, correlated references, type information).

Output — The AST

The parser returns a Stmt enum. After parsing, all Expr::Column nodes have col_idx = 0 as a placeholder. The semantic analyzer fills in the correct indices.

#![allow(unused)]
fn main() {
pub enum Stmt {
    Select(SelectStmt),
    Insert(InsertStmt),
    Update(UpdateStmt),
    Delete(DeleteStmt),
    CreateTable(CreateTableStmt),
    CreateIndex(CreateIndexStmt),
    DropTable(DropTableStmt),
    DropIndex(DropIndexStmt),
    AlterTable(AlterTableStmt),
    Truncate(TruncateTableStmt),
    Begin, Commit, Rollback,
    Savepoint(String),
    ReleaseSavepoint(String),
    RollbackToSavepoint(String),
    ShowTables(ShowTablesStmt),
    ShowColumns(ShowColumnsStmt),
}
}

Scalar Function Evaluator (`eval/`)

The expression evaluator now lives under crates/axiomdb-sql/src/eval/, rooted at eval/mod.rs. The facade keeps the same exported surface (eval, eval_with, eval_in_session, eval_with_in_session, is_truthy, like_match, CollationGuard, SubqueryRunner), but the implementation is split by responsibility:

context.rs — thread-local session collation, CollationGuard, and SubqueryRunner
core.rs — recursive Expr evaluation, CASE dispatch, and subquery-aware paths
ops.rs — boolean logic, comparisons, IN, LIKE, and truthiness helpers
functions/ — built-ins grouped by family (system, nulls, numeric, string, datetime, binary, uuid)

Built-in function dispatch still happens by lowercased name inside functions/mod.rs. The registry remains a single match arm: no hash map and no dynamic dispatch.

⚙️

Design Decision — Split Without Semantic Drift Like PostgreSQL's separation between expression evaluation helpers and executor nodes, AxiomDB now splits evaluator internals by responsibility while keeping the same public entrypoints and static built-in dispatch. The payoff is lower maintenance cost without adding virtual dispatch or a mutable function registry.

Date / Time Functions (4.19d)

Four internal helpers drive the MySQL-compatible date functions:

#![allow(unused)]
fn main() {
// Converts Value::Timestamp(micros_since_epoch) to NaiveDateTime.
// Uses Euclidean division for correct sub-second handling of pre-epoch timestamps.
fn micros_to_ndt(micros: i64) -> NaiveDateTime

// Converts Value::Date(days_since_epoch) to NaiveDate.
fn days_to_ndate(days: i32) -> NaiveDate

// Formats NaiveDateTime using MySQL-style format specifiers.
// Maps specifiers manually — NOT via chrono's format strings — to guarantee
// exact MySQL semantics (e.g. chrono's %m has different behavior).
fn date_format_str(ndt: NaiveDateTime, fmt: &str) -> String

// Parses a string into NaiveDateTime + a has_time flag.
// Returns None on any failure (caller maps to Value::Null).
fn str_to_date_inner(s: &str, fmt: &str) -> Option<(NaiveDateTime, bool)>
}

DATE_FORMAT arm — evaluates both args, dispatches ts on type:

ts: Timestamp(micros) → micros_to_ndt → NaiveDateTime
ts: Date(days)        → days_to_ndate → NaiveDate.and_time(MIN) → NaiveDateTime
ts: Text(s)           → try "%Y-%m-%d %H:%i:%s" then "%Y-%m-%d" via str_to_date_inner
ts: NULL              → return NULL immediately

STR_TO_DATE arm — calls str_to_date_inner and converts back to a Value:

has_time = true  → Value::Timestamp((ndt - epoch).num_microseconds())
has_time = false → Value::Date((ndt.date() - epoch).num_days() as i32)
failure          → Value::Null

The epoch used for both conversions is always NaiveDate(1970-01-01) 00:00:00 constructed with from_ymd_opt(1970,1,1).unwrap().and_hms_opt(0,0,0).unwrap(). This avoids any DateTime<Utc> and is stable across all chrono 0.4.x versions.

str_to_date_inner processes the format string character by character:

Literal characters: must match verbatim in the input (returns None on mismatch).
%Y: consume exactly 4 digits.
%y: consume 1–2 digits; apply MySQL 2-digit rule (<70 → +2000, else +1900).
%m, %c, %d, %e, %H, %h, %i, %s/%S: consume 1–2 digits.
Unknown specifier: skip one character in the input string.
After parsing: validate with NaiveDate::from_ymd_opt + NaiveTime::from_hms_opt (catches invalid dates such as Feb 30).

take_digits(s, max) — helper used by the parser:

#![allow(unused)]
fn main() {
fn take_digits(s: &str, max: usize) -> Option<(u32, &str)> {
    let n = s.bytes().take(max).take_while(|b| b.is_ascii_digit()).count();
    if n == 0 { return None; }
    let val: u32 = s[..n].parse().ok()?;
    Some((val, &s[n..]))
}
}

Uses byte positions (safe for all ASCII date strings) and avoids allocations.

GROUP_CONCAT Parsing

GROUP_CONCAT cannot be represented as a plain Expr::Function { args: Vec<Expr> } because its interior grammar — [DISTINCT] expr [ORDER BY ...] [SEPARATOR 'str'] — is not a standard argument list. It gets its own AST variant and a dedicated parser branch.

The `Expr::GroupConcat` Variant

#![allow(unused)]
fn main() {
pub enum Expr {
    // ...
    GroupConcat {
        expr: Box<Expr>,
        distinct: bool,
        order_by: Vec<(Expr, SortOrder)>,
        separator: String,          // defaults to ","
    },
}
}

The variant stores the sub-expression to concatenate, the deduplication flag, an ordered list of (sort_key_expr, direction) pairs, and the separator string.

`Token::Separator` — Disambiguating the Keyword

SEPARATOR is not a reserved word in standard SQL, so the lexer could produce either Token::Ident("SEPARATOR") or a dedicated Token::Separator. AxiomDB uses the dedicated token so that the ORDER BY loop inside parse_group_concat can stop cleanly:

#![allow(unused)]
fn main() {
// In the ORDER BY loop — stop if we see SEPARATOR or closing paren
if matches!(p.peek(), Token::Separator | Token::RParen) {
    break;
}
}

Without the dedicated token, the parser would need to look ahead through a comma and an identifier to decide whether the comma ends the ORDER BY clause or separates two sort keys.

`parse_group_concat` — The Parser Branch

Invoked when parse_ident_or_call encounters group_concat (case-insensitive):

parse_group_concat:
  consume '('
  if DISTINCT: set distinct=true, advance
  parse_expr() → sub-expression
  if ORDER BY:
    loop:
      parse_expr() → sort key
      optional ASC|DESC → direction
      if peek == SEPARATOR or RParen: break
      else: consume ','
  if SEPARATOR:
    consume SEPARATOR
    consume StringLit(s) → separator string
  consume ')'
  return Expr::GroupConcat { expr, distinct, order_by, separator }

`string_agg` — PostgreSQL Alias

string_agg(expr, separator_literal) is parsed in the same branch with simplified logic: two arguments separated by a comma, the second being a string literal that becomes the separator field. distinct is false and order_by is empty.

-- These are equivalent:
SELECT GROUP_CONCAT(name SEPARATOR ', ')   FROM t;
SELECT string_agg(name, ', ')              FROM t;

Aggregate Execution in the Executor

At execution time, Expr::GroupConcat is handled by an AggAccumulator::GroupConcat variant. Each row accumulates (value_string, sort_key_values). At finalize:

Sort by the order_by key vector using compare_values_null_last — a type-aware comparator that sorts integers numerically and text lexicographically.
If DISTINCT: deduplicate by value string.
Join with separator, truncate at 1 MB.
Return Value::Null if no non-NULL values were accumulated.

⚙️

Design Decision — Dedicated AST Variant MySQL's GROUP_CONCAT syntax is structurally different from a regular function call: it embeds its own ORDER BY and uses a keyword (SEPARATOR) as a positional argument delimiter. Forcing it into Expr::Function { args } would require post-parse AST surgery to extract the separator and ORDER BY. A dedicated variant keeps parsing and execution logic clean and makes semantic analysis and partial-index rejection straightforward.

Error Reporting

ParseError — structured position field

Parse errors carry a dedicated position field (0-based byte offset of the unexpected token):

#![allow(unused)]
fn main() {
DbError::ParseError {
    message: "SQL syntax error: unexpected token 'FORM'".to_string(),
    position: Some(9),   // byte 9 in "SELECT * FORM t"
}
}

The position field is populated from SpannedToken::span.start at every error site in the parser. Non-parser code that constructs ParseError (e.g. codec validation, catalog checks) sets position: None.

Visual snippet in MySQL ERR packets

When the MySQL handler sends an ERR packet for a parse error, it builds a 2-line visual snippet:

You have an error in your SQL syntax: unexpected token 'FORM'
SELECT * FORM t
         ^

The snippet is generated by build_error_snippet(sql, pos) in mysql/error.rs:

Find the line containing pos (line_start = last \n before pos, line_end = next \n).
Clamp the line to 120 characters to avoid overwhelming terminal output.
Compute col = pos - line_start and emit " ".repeat(col) + "^" on the second line.

The snippet is appended only when sql is available (COM_QUERY path). Prepared statement execution errors (COM_STMT_EXECUTE) receive only the plain message.

JSON error format

When error_format = 'json' is active on the connection, the MySQL ERR packet message is replaced with a JSON string carrying the full ErrorResponse:

{"code":1064,"sqlstate":"42601","severity":"ERROR","message":"SQL syntax error: unexpected token 'FORM'","position":9}

The JSON is built by build_json_error(e, sql) in mysql/json_error.rs. It uses the ErrorResponse::from_error(e) struct for clean, snippet-free fields (the visual snippet is text-protocol-only). The JsonErrorPayload struct lives in axiomdb-network to avoid adding serde as a dependency to axiomdb-core.

⚙️

Design Decision — serde Boundary axiomdb-core defines DbError and ErrorResponse with no serde dependency. The JSON payload is assembled in axiomdb-network using a private #[derive(Serialize)] JsonErrorPayload struct. This keeps the core crate free of serialization complexity and means error types never accidentally get serialized somewhere they shouldn't.

Lexer errors (invalid characters, unterminated string literals) include the byte span of the problematic token via the same position field.

Performance Numbers

Measured on Apple M2 Pro, single-threaded, 1 million iterations each:

Query	Throughput (logos lexer + parser)
`SELECT * FROM t`	492 ns / query → 2.0M queries/s
`SELECT a, b, c FROM t WHERE id = 1`	890 ns / query → 1.1M queries/s
Complex SELECT (3 JOINs, subquery)	2.7 µs / query → 370K queries/s
`CREATE TABLE` (10 columns)	1.1 µs / query → 910K queries/s
`INSERT ... VALUES (...)` (5 values)	680 ns / query → 1.5M queries/s

These numbers represent parse throughput only — before semantic analysis or execution. At 2 million simple queries per second, parsing is never the bottleneck for OLTP workloads at realistic connection concurrency.

⚙️

Zero-Copy Token Design Identifiers are &'src str slices into the original SQL string — no heap allocation during lexing. The Rust lifetime 'src enforces at compile time that tokens cannot outlive the input. Only StringLit allocates, because escape processing (\', \\, \n) must transform the content in place.

Semantic Analyzer

The semantic analyzer is the stage between parsing and execution. The parser produces an AST where every column reference has col_idx = 0 as a placeholder. The analyzer:

Validates all table and column names against the catalog.
Resolves each col_idx to the correct position in the combined row produced by the FROM and JOIN clauses.
Reports structured errors for unknown tables, unknown columns, and ambiguous unqualified column names.
Applies the current database + schema defaults before unqualified table resolution.

The public compatibility entry point is:

#![allow(unused)]
fn main() {
analyze(stmt, storage, snapshot) -> Result<Stmt, DbError>
}

Internally, the multi-database-aware entry point is:

#![allow(unused)]
fn main() {
analyze_with_defaults(stmt, storage, snapshot, default_database, default_schema)
}

The compatibility wrapper currently uses ("axiomdb", "public").

BindContext — Resolution State

BindContext is built from the FROM and JOIN clauses of a SELECT before any column reference is resolved.

#![allow(unused)]
fn main() {
struct BindContext {
    tables: Vec<BoundTable>,
}

struct BoundTable {
    alias:      Option<String>,  // FROM users AS u → alias = Some("u")
    name:       String,          // real table name in the catalog
    columns:    Vec<ColumnDef>,  // columns in declaration order (from CatalogReader)
    col_offset: usize,           // start position in the combined row
}
}

Building the BindContext

Each table in the FROM clause is added in left-to-right order. The col_offset of each table is the sum of the column counts of all tables added before it.

FROM users u JOIN orders o ON u.id = o.user_id

Table 1: users (4 columns: id, name, age, email) → col_offset = 0
Table 2: orders (4 columns: id, user_id, total, status) → col_offset = 4

Combined row layout:
  col 0  u.id
  col 1  u.name
  col 2  u.age
  col 3  u.email
  col 4  o.id
  col 5  o.user_id
  col 6  o.total
  col 7  o.status

Database-Scoped Resolution

Table lookup is now keyed by:

(database, schema, table)

The analyzer threads default_database into every catalog lookup and recursive subquery analysis. For session-driven execution, that default comes from SessionContext::effective_database().

Legacy compatibility rule:

if a table has no explicit database binding:
    it belongs to axiomdb

So identical SQL text can resolve differently depending on the selected database:

USE analytics;
SELECT * FROM users;

USE axiomdb;
SELECT * FROM users;

⚙️

Design Decision — Selected vs Effective DB MySQL allows DATABASE() to be NULL before any explicit selection, but AxiomDB still had to keep legacy unqualified table names working. The analyzer therefore resolves against an effective database with fallback axiomdb, while the session separately tracks whether the user explicitly selected a database.

Column Resolution Algorithm

Given a column reference (qualifier, name) from the AST:

Qualified Reference (`u.email`)

Find the BoundTable whose alias or name matches qualifier.
- If no table matches: DbError::TableNotFound { name: qualifier }.
Within that table’s columns, find the column whose name matches name.
- If not found: DbError::ColumnNotFound { table: qualifier, column: name }.
Return col_offset + column_position_within_table.

u.email  →  users.col_offset (0) + position of "email" in users (3) = 3
o.total  →  orders.col_offset (4) + position of "total" in orders (2) = 6

Unqualified Reference (`name` only)

Search all tables in BindContext for a column named name.
Collect all matches across all tables.
If 0 matches: DbError::ColumnNotFound.
If 1 match: return the resolved col_idx.
If 2+ matches: DbError::AmbiguousColumn { column: name, candidates: [...] }.

-- Unambiguous: only users has 'name'
SELECT name FROM users JOIN orders ON ...

-- Ambiguous: both users and orders have 'id'
SELECT id FROM users JOIN orders ON ...
-- ERROR 42702: column reference "id" is ambiguous
-- (appears in: users.id, orders.id)

-- Fix: qualify the reference
SELECT users.id FROM users JOIN orders ON ...

Subqueries in FROM

Subqueries in the FROM clause (derived tables) are analyzed recursively:

SELECT outer.total
FROM (
    SELECT user_id, SUM(total) AS total
    FROM orders
    WHERE status = 'paid'
    GROUP BY user_id
) AS outer
WHERE outer.total > 1000

The inner SELECT is analyzed first, producing a virtual BoundTable whose columns are the output columns of the subquery (user_id, total). The outer BindContext then treats this virtual table exactly like a real catalog table.

What the Analyzer Validates per Statement Type

SELECT

FROM clause: every table reference exists in the catalog (or is a valid subquery).
JOIN conditions: every column in ON expr resolves correctly against the BindContext.
SELECT list: every column reference resolves; computed expressions type-check.
WHERE clause: every column reference resolves.
GROUP BY: every expression resolves.
HAVING: every column reference resolves (must be either in GROUP BY or aggregate).
ORDER BY: every expression resolves.

INSERT

Target table exists in the catalog.
Each named column in the column list exists in the table.
If INSERT ... SELECT, the inner SELECT is analyzed.
Column count in VALUES must match the column list (or all non-DEFAULT columns if no column list is given).

UPDATE

Target table exists in the catalog.
Every column in SET assignments exists in the table.
WHERE clause column references resolve against the target table.

DELETE

Target table exists in the catalog.
WHERE clause column references resolve against the target table.

CREATE TABLE

No table with the same name exists (unless IF NOT EXISTS).
Each REFERENCES table(col) in a foreign key references a table that exists and a column that exists in that table and is a primary key or unique column.
CHECK expressions are parsed and type-checked (must evaluate to boolean).

DROP TABLE

Target table exists (unless IF EXISTS).
No other table has a foreign key pointing to the target (unless CASCADE).

CREATE INDEX

Target table exists in the catalog.
Every indexed column exists in the table.
No index with the same name already exists (unless IF NOT EXISTS).

CREATE DATABASE / DROP DATABASE / USE / SHOW DATABASES

These statements are mostly pass-through at the analyzer layer:

CREATE DATABASE and DROP DATABASE carry names but no column bindings
USE is validated against the database catalog at execution/wire time
SHOW DATABASES produces a computed rowset and needs no name resolution

Error Types

Error	SQLSTATE	When it occurs
`TableNotFound`	42P01	FROM, JOIN, or REFERENCES points to unknown table
`ColumnNotFound`	42703	Column name not in any in-scope table
`AmbiguousColumn`	42702	Unqualified column matches in multiple tables
`DuplicateTable`	42P07	CREATE TABLE for an existing table
`TypeMismatch`	42804	Expression type incompatible with column type

Snapshot Isolation in the Analyzer

The analyzer calls CatalogReader::list_tables and CatalogReader::list_columns with the caller’s TransactionSnapshot. This means the analyzer sees the schema as it appeared at the start of the current transaction, not the latest committed schema.

This ensures that:

A concurrent DDL (CREATE TABLE) that commits after the current transaction began is invisible to the current transaction’s analyzer.
Schema changes within the same transaction are visible to subsequent statements in that same transaction.

Post-Analysis AST

After analysis, every Expr::Column in the AST has its col_idx set to the correct position in the combined row. The executor uses col_idx to index directly into the row array — no name lookup occurs at execution time.

#![allow(unused)]
fn main() {
// Before analysis (from parser):
Expr::Column { name: "total".to_string(), table: Some("o".to_string()), col_idx: 0 }

// After analysis (from analyzer):
Expr::Column { name: "total".to_string(), table: Some("o".to_string()), col_idx: 6 }
// col_idx = orders.col_offset (4) + position of "total" in orders (2)
}

This separation of concerns means the executor is a pure interpreter over the analyzed AST — it never touches the catalog and never performs name resolution. All validation errors are caught before any I/O begins.

SQL Executor

The executor is the component that interprets an analyzed Stmt (all column references resolved to col_idx by the semantic analyzer) and drives it to completion, returning a QueryResult. It is the highest-level component in the query pipeline.

Since subphase 5.19a, the executor no longer lives in a single source file. It is organized under crates/axiomdb-sql/src/executor/ with mod.rs as the stable facade and responsibility-based source files behind it.

Source Layout

File	Responsibility
`executor/mod.rs`	public facade, statement dispatch, thread-local last-insert-id
`executor/shared.rs`	helpers shared across multiple statement families
`executor/select.rs`	SELECT entrypoints, projection, ORDER BY/LIMIT wiring
`executor/joins.rs`	nested-loop join execution and join-specific metadata
`executor/aggregate.rs`	GROUP BY, aggregates, DISTINCT/group-key helpers
`executor/insert.rs`	INSERT and INSERT … SELECT paths
`executor/update.rs`	UPDATE execution
`executor/delete.rs`	DELETE execution and candidate collection
`executor/bulk_empty.rs`	shared bulk-empty helpers for DELETE/TRUNCATE
`executor/ddl.rs`	DDL, SHOW, ANALYZE, TRUNCATE
`executor/staging.rs`	transactional INSERT staging flushes and barrier handling

⚙️

Design Decision — File-Level Split First The first goal of `5.19a` was to make the executor readable and changeable without changing SQL behavior. The split preserves the existing facade and keeps later DML optimizations isolated from unrelated SELECT and DDL code.

Integration Test Layout

The executor integration coverage no longer sits in one giant test binary. The current axiomdb-sql/tests/ layout is responsibility-based, mirroring the module split in src/executor/.

Binary	Main responsibility
`integration_executor`	CRUD base and simple transaction behavior
`integration_executor_joins`	JOINs and aggregate execution
`integration_executor_query`	`ORDER BY`, `LIMIT`, `DISTINCT`, `CASE`, `INSERT ... SELECT`, `AUTO_INCREMENT`
`integration_executor_ddl`	`SHOW`, `DESCRIBE`, `TRUNCATE`, `ALTER TABLE`
`integration_executor_ctx`	base `SessionContext` execution and `strict_mode`
`integration_executor_ctx_group`	ctx-path sorted group-by
`integration_executor_ctx_limit`	ctx-path `LIMIT` / `OFFSET` coercion
`integration_executor_ctx_on_error`	ctx-path `on_error` behavior
`integration_executor_sql`	broader SQL semantics outside the ctx path
`integration_delete_apply`	bulk and indexed DELETE apply paths
`integration_insert_staging`	transactional INSERT staging
`integration_namespacing`	database catalog behavior: `CREATE/DROP DATABASE`, `USE`, `SHOW DATABASES`
`integration_namespacing_cross_db`	explicit `database.schema.table` resolution and cross-db DML/DDL
`integration_namespacing_schema`	schema namespacing, `search_path`, and schema-aware `SHOW TABLES`

Shared helpers live in crates/axiomdb-sql/tests/common/mod.rs.

The day-to-day workflow is intentionally narrow:

start with the smallest binary that matches the code path you changed
add directly related binaries only when the change touches shared helpers or a nearby execution path
use cargo test -p axiomdb-sql --tests as the crate-level confidence gate, not as the default inner-loop command
if a new behavior belongs to an existing themed binary, add the test there instead of creating a new binary immediately

cargo test -p axiomdb-sql --test integration_executor_query
cargo test -p axiomdb-sql --test integration_executor_query test_insert_select_aggregation -- --exact

⚙️

Design Decision — Smallest Relevant Test First The test split follows the same idea as the executor source split: isolate one execution path per binary so everyday validation can run only the code that actually changed, while still keeping closely related behavior together in the same harness.

UPDATE Apply Fast Path (`6.20`)

6.17 fixed indexed UPDATE discovery, but the default update_range benchmark was still paying most of its cost after rows had already been found. 6.20 removes that apply-side overhead in four steps:

IndexLookup / IndexRange candidates are decoded through TableEngine::read_rows_batch(...), which groups RecordIds by page_id and restores the original RID order after each page is read once.
UPDATE evaluates the new row image before touching the heap and drops rows whose new_values == old_values.
Stable-RID rewrites accumulate (key, old_tuple_image, new_tuple_image, page_id, slot_id) and emit their normal UpdateInPlace WAL records through one record_update_in_place_batch(...) call.
If any index really is affected, UPDATE now does one grouped delete pass, one grouped insert pass, and one final root persistence write per index.

The coarse executor bailout is statement-level:

if all physically changed rows keep the same RID
   and no SET column overlaps any index key column
   and no SET column overlaps any partial-index predicate dependency:
    skip index maintenance for the statement

This is the common PK-only UPDATE score WHERE id BETWEEN ... case in local_bench.py.

⚙️

Borrowed Modified-Attrs Rule PostgreSQL HOT, SQLite's indexColumnIsBeingUpdated(), and MariaDB's clustered-vs-secondary UPDATE split all ask the same question: did any index-relevant attribute actually change? `6.20` adapts that rule directly, without adding HOT chains or change buffering.

🚀

Performance Advantage On the release `update_range` benchmark, `6.20` raises AxiomDB from 85.2K to 369.9K rows/s by removing per-row heap reads, per-row no-op rewrites, and per-row `UpdateInPlace` append overhead from the default PK-only path.

Entry Point

#![allow(unused)]
fn main() {
pub fn execute(
    stmt: Stmt,
    storage: &mut dyn StorageEngine,
    txn: &mut TxnManager,
) -> Result<QueryResult, DbError>
}

When no transaction is active, execute wraps the statement in an implicit BEGIN / COMMIT (autocommit mode). Transaction control statements (BEGIN, COMMIT, ROLLBACK) bypass autocommit and operate on TxnManager directly.

All reads use txn.active_snapshot()? — a snapshot fixed at BEGIN — so that writes made earlier in the same transaction are visible (read-your-own-writes).

Transactional INSERT staging (Phase 5.21)

5.21 adds a statement-boundary staging path for consecutive INSERT ... VALUES statements inside one explicit transaction.

Data structure

SessionContext now owns:

#![allow(unused)]
fn main() {
PendingInsertBatch {
    table_id: u32,
    table_def: TableDef,
    columns: Vec<ColumnDef>,
    indexes: Vec<IndexDef>,
    compiled_preds: Vec<Option<Expr>>,
    rows: Vec<Vec<Value>>,
    unique_seen: HashMap<u32, HashSet<Vec<u8>>>,
}
}

The buffer exists only while the connection is inside an explicit transaction. Autocommit-wrapped single statements do not use it.

Enqueue path

For every eligible INSERT row, executor/insert.rs does all logical work up front:

evaluate expressions
expand omitted columns
assign AUTO_INCREMENT if needed
run CHECK constraints
run FK child validation
reject duplicate UNIQUE / PK keys against:
- committed index state
- unique_seen inside the current batch
append the fully materialized row to PendingInsertBatch.rows

No heap write or WAL append happens yet.

Flush barriers

The batch is flushed before:

SELECT
UPDATE
DELETE
DDL
COMMIT
table switch to another INSERT target
any ineligible INSERT shape

ROLLBACK discards the batch without heap or WAL writes.

Savepoint ordering invariant

When a transaction uses statement-level savepoints (rollback_statement, savepoint, ignore), the executor must flush staged rows before taking the next statement savepoint if the current statement cannot continue the batch.

Without that ordering, a failing statement after a table switch could roll back rows that logically belonged to earlier successful INSERT statements.

⚙️

Design Decision — Flush Before Savepoint The first `5.21` implementation flushed a table-switch batch inside the next INSERT handler, which placed the flush after the statement savepoint and let a later duplicate-key error roll it back. The final design moves that decision to the statement boundary so savepoint semantics stay identical to pre-staging behavior.

Flush algorithm

executor/staging.rs performs:

TableEngine::insert_rows_batch_with_ctx(...)
batch_insert_into_indexes(...)
one CatalogWriter::update_index_root(...) per changed index
stats update

The current design still inserts index entries row-by-row inside the flush. That cost is explicit and remains the next insert-side optimization candidate if future profiling shows it dominates after staging.

ClusteredInsertBatch (Phase 40.1)

Phase 40.1 extends the PendingInsertBatch pattern to clustered (primary-key ordered) tables, eliminating the per-row CoW B-tree overhead that made clustered inserts 2× slower than heap inserts inside explicit transactions.

Root cause of the pre-40.1 gap

Before 40.1, every clustered INSERT inside an explicit transaction called apply_clustered_insert_rows immediately, which performs:

storage.read_page(root) — 16 KB page read
storage.write_page(new_root, page) — 16 KB CoW page write
WAL append
Secondary index write

For N = 50 000 rows that is 100 000 storage operations just for the base tree.

Data structures

#![allow(unused)]
fn main() {
// session.rs
pub struct StagedClusteredRow {
    pub values: Vec<Value>,
    pub encoded_row: Vec<u8>,
    pub primary_key_values: Vec<Value>,
    pub primary_key_bytes: Vec<u8>,
}

pub struct ClusteredInsertBatch {
    pub table_id: u32,
    pub table_def: TableDef,
    pub primary_idx: IndexDef,
    pub secondary_indexes: Vec<IndexDef>,
    pub secondary_layouts: Vec<ClusteredSecondaryLayout>,
    pub compiled_preds: Vec<Option<Expr>>,
    pub rows: Vec<StagedClusteredRow>,
    pub staged_pks: HashSet<Vec<u8>>,   // O(1) intra-batch PK dedup
}
}

StagedClusteredRow is structurally identical to PreparedClusteredInsertRow (defined in clustered_table.rs) but lives in session.rs to avoid a circular dependency: clustered_table.rs imports SessionContext.

Enqueue path (`enqueue_clustered_insert_ctx`)

For each row in the VALUES list:

Evaluate expressions, expand columns, assign AUTO_INCREMENT.
Validate CHECK constraints and FK child references.
Encode via prepare_row_with_ctx (coerce + PK extract + row codec).
Check staged_pks — return UniqueViolation and discard batch on intra-batch PK duplicate.
Push StagedClusteredRow and insert PK bytes into staged_pks.

Committed-data PK duplicates are caught at flush time by lookup_physical inside apply_clustered_insert_rows (same as the pre-40.1 single-statement path).

Flush path (`flush_clustered_insert_batch`)

1. Sort staged rows ascending by pk_bytes
   → enables append-biased detection in apply_clustered_insert_rows
2. Convert StagedClusteredRow → PreparedClusteredInsertRow (field move)
3. Call apply_clustered_insert_rows (existing function):
     a. detect append-biased pattern (all PKs increasing)
     b. loop: try_insert_rightmost_leaf_batch → fast O(leaf_capacity) write
        fallback: single-row clustered_tree::insert
     c. WAL record_clustered_insert per row
     d. maintain_clustered_secondary_inserts per row
     e. persist changed roots
4. ctx.stats.on_rows_changed + ctx.invalidate_all

🚀

55.9K rows/s vs MySQL 8.0 InnoDB 35K rows/s For 50K sequential PK rows in one explicit transaction, AxiomDB 40.1 delivers 55.9K rows/s — +59% faster than MySQL 8.0 InnoDB's ~35K rows/s. MySQL's buffer pool amortizes page writes across multiple connections; AxiomDB's batch achieves the same effect for a single connection by deferring all writes to flush time and using try_insert_rightmost_leaf_batch to fill each leaf page once.

Barrier detection

should_flush_clustered_batch_before_stmt returns false only when the next statement is a VALUES INSERT into the same clustered table (batch continues). For all other statements, the batch is flushed before dispatch. This mirrors the existing should_flush_pending_inserts_before_stmt logic for heap tables.

ROLLBACK discards the batch via discard_clustered_insert_batch() — no storage writes, no WAL entries, no undo needed.

CREATE INDEX on clustered tables (Phase 40.1b)

execute_create_index (ddl.rs) now handles both heap and clustered tables with a single function. The dispatch happens after the B-Tree root page is allocated:

if table_def.is_clustered() {
    primary_idx  ← CatalogReader::list_indexes → find(is_primary)
    preview_def  ← IndexDef { columns, is_unique, fillfactor, root_page_id, … }
    layout       ← ClusteredSecondaryLayout::derive(&preview_def, &primary_idx)
    rows         ← scan_clustered_table(storage, &table_def, &col_defs, snap)
    for row in rows:
        if partial predicate → skip non-matching rows
        entry = layout.entry_from_row(row)  → physical_key for bloom
        layout.insert_row(storage, &root_pid, row)  → uniqueness + B-Tree insert
} else {
    rows ← scan_table(…)
    for row in rows: encode_index_key + BTree::insert_in
}
// step 8: stats bootstrap uses same `rows` Vec — no extra I/O

The ClusteredSecondaryLayout encodes the physical key as secondary_cols ++ suffix_primary_cols — exactly the format used by runtime INSERT/UPDATE/DELETE, so a clustered secondary index built by CREATE INDEX is byte-for-byte compatible with those written by the DML executors.

entry_from_row is called once per row to collect the physical key for the bloom filter, and insert_row calls it again internally for the B-Tree write. This is acceptable overhead during a DDL operation (O(n) with constant factor ≈2).

NULL handling

ClusteredSecondaryLayout::entry_from_row returns None when any secondary column is NULL. Both the bloom key collection and the B-Tree insert are skipped in that case, consistent with the runtime INSERT path and SQL standard NULL semantics for indexes.

Uniqueness enforcement

insert_row delegates uniqueness to ensure_unique_logical_key_absent, the same function used at runtime. If an existing row already carries that logical key, the build fails with DbError::UniqueViolation before the catalog entry is written.

Query Pipeline

SQL string
  → tokenize()         logos DFA, ~85 tokens, zero-copy &str
  → parse()            recursive descent, produces Stmt with col_idx = 0
  → analyze()          BindContext resolves every col_idx
  → execute()          dispatches to per-statement handler
      ├── scan_table   HeapChain::scan_visible + decode_row
      ├── filter       eval(WHERE, &row) + is_truthy
      ├── join         nested-loop, apply_join
      ├── aggregate    hash-based GroupState
      ├── sort         apply_order_by, compare_sort_values
      ├── deduplicate  apply_distinct, value_to_key_bytes
      ├── project      project_row / project_grouped_row
      └── paginate     apply_limit_offset
  → QueryResult::Rows / Affected / Empty

JOIN — Nested Loop

Phase 4 implements nested-loop joins. All tables are pre-scanned once before any loop begins — scanning inside the inner loop would re-read the same data O(n) times and could see partially-inserted rows.

Algorithm

scanned[0] = scan(FROM table)
scanned[1] = scan(JOIN[0] table)
...

combined_rows = scanned[0]
for each JoinClause in stmt.joins:
    combined_rows = apply_join(combined_rows, scanned[i+1], join_type, ON/USING)

`apply_join` per type

Join type	Behavior
`INNER` / `CROSS`	Emit combined row for each pair where ON is truthy
`LEFT`	Emit all left rows; unmatched left → right side padded with `NULL`
`RIGHT`	Emit all right rows; unmatched right → left side padded with `NULL`; uses a `matched_right: Vec<bool>` bitset
`FULL`	`NotImplemented` — Phase 4.8+

USING condition

USING(col_name) is resolved at execution time using left_schema: Vec<(name, col_idx)>, accumulated across all join stages. The condition combined[left_idx] == combined[right_idx] uses SQL equality — NULL = NULL returns UNKNOWN (false), so NULLs never match in USING.

⚙️

Design Decision — Pre-scan Before Loop All tables are scanned once before the nested-loop begins. This is the primary anti-pattern to avoid: scanning inside the inner loop re-reads data O(n) times and, for LEFT/RIGHT joins that modify the heap, can observe partially-inserted rows. Pre-scanning also enables the RIGHT JOIN bitset pattern, which requires knowing the total right-side row count upfront.

GROUP BY — Strategy Selection (Phase 4.9b)

The executor selects between two GROUP BY execution strategies at runtime:

Strategy	When selected	Behavior
`Hash`	Default; JOINs; derived tables; plain scans	HashMap per group key; `O(k)` memory
`Sorted { presorted: true }`	Single-table ctx path + compatible B-Tree index	Stream adjacent equal groups; `O(1)` memory

#![allow(unused)]
fn main() {
enum GroupByStrategy {
    Hash,
    Sorted { presorted: bool },
}
}

Strategy selection (choose_group_by_strategy_ctx) is only active on the single-table ctx path (execute_with_ctx). All JOIN, derived-table, and non-ctx paths use Hash.

Prefix Match Rule

The sorted strategy is selected when all four conditions hold:

Access method is IndexLookup, IndexRange, or IndexOnlyScan.
Every GROUP BY expression is a plain Expr::Column (no function calls, no aliases).
The column references match the leading key prefix of the chosen index in the same order.
The prefix length ≤ number of index columns.

Examples (index (region, dept)):

GROUP BY	Result
`region, dept`	✅ Sorted
`region`	✅ Sorted (prefix)
`dept, region`	❌ Hash (wrong order)
`LOWER(region)`	❌ Hash (computed expression)

This is correct because BTree::range_in guarantees rows arrive in key order, and equal leading prefixes are contiguous even with extra suffix columns or RID suffixes on non-unique indexes.

⚙️

Design Decision — Borrowed from PostgreSQL + DuckDB PostgreSQL keeps GroupAggregate (sorted) and HashAggregate as separate strategies selected at planning time (pathnodes.h). DuckDB selects the aggregation strategy at physical plan time based on input guarantees. AxiomDB borrows the two-strategy concept but selects at execution time using the already-chosen access method — no separate planner pass needed.

GROUP BY — Hash Aggregation

GROUP BY uses a single-pass hash aggregation strategy: one scan through the filtered rows, accumulating aggregate state per group key.

Specialized Hash Tables (subfase 39.21)

Two hash table types avoid generic dispatch overhead:

GroupTablePrimitive — single-column GROUP BY on integer-like values (INT, BIGINT, DOUBLE, Bool). Maps i64 → GroupEntry via hashbrown::HashMap<i64, usize>. No key serialization needed; comparison is a single integer equality check.
GroupTableGeneric — multi-column GROUP BY, TEXT columns, mixed types, and the global no-GROUP-BY case. Serializes group keys into a Vec<u8> reused across rows (zero allocation when capacity fits), maps &[u8] → GroupEntry via hashbrown::HashMap<Box<[u8]>, usize>.

Both tables store entries in a Vec<GroupEntry> and use the hash maps as index structures. This keeps entries contiguous in memory and avoids pointer chasing during the accumulation loop.

🚀

Performance Advantage — hashbrown vs std HashMap hashbrown (the same table backing Rust's std::HashMap) uses SIMD-accelerated quadratic probing (SSE2/NEON). For a 62-group workload over 50K rows, this cuts probe overhead by ~30% vs a naïve open-addressing table. The specialized GroupTablePrimitive path avoids serialization entirely, reducing per-row work to one integer hash + one equality check.

Group Key Serialization

Value contains f64 which does not implement Hash in Rust. AxiomDB uses a custom self-describing byte serialization instead of the row codec:

value_to_key_bytes(Value::Null)        → [0x00]
value_to_key_bytes(Value::Int(n))      → [0x02, n as 4 LE bytes]
value_to_key_bytes(Value::Text(s))     → [0x06, len as 4 LE bytes, UTF-8 bytes]
...

Two NULL values produce identical bytes [0x00] → they form one group. This matches SQL GROUP BY semantics: NULLs are considered equal for grouping (unlike NULL = NULL in comparisons, which is UNKNOWN).

The group key for a multi-column GROUP BY is the concatenation of all column serializations. The key_buf: Vec<u8> is allocated once before the scan loop and reused (with clear() + extend_from_slice) for every row, so multi-column GROUP BY does not allocate per row for the probe step.

GroupEntry

Each unique group key maps to a GroupEntry:

#![allow(unused)]
fn main() {
struct GroupEntry {
    key_values: Vec<Value>,        // GROUP BY expression results (for output)
    non_agg_col_values: Vec<Value>, // non-aggregate SELECT cols (for HAVING/output)
    accumulators: Vec<AggAccumulator>,
}
}

non_agg_col_values is a sparse slice: only columns referenced by non-aggregate SELECT items or HAVING expressions are stored. Their indices are pre-computed once (compute_non_agg_col_indices) before the scan loop and reused for every group.

⚙️

Design Decision — non_agg_col_values vs representative_row Earlier versions stored representative_row: Row — the full first source row per group — to resolve HAVING column references. This costs one full Vec<Value> clone per group, regardless of how many columns HAVING actually needs. non_agg_col_values stores only the columns referenced by non-aggregate SELECT items and HAVING, computed once before the scan loop. For a 6-column table where HAVING references 1 column, this reduces per-group memory by ~83%.

Aggregate Accumulators

Aggregate	Accumulator	NULL behavior
`COUNT(*)`	`u64` counter	Increments for every row
`COUNT(col)`	`u64` counter	Skips rows where `col` is NULL
`SUM(col)`	`Option<Value>`	Skips NULL; `None` if all rows are NULL
`MIN(col)`	`Option<Value>`	Skips NULL; tracks running minimum
`MAX(col)`	`Option<Value>`	Skips NULL; tracks running maximum
`AVG(col)`	`(sum: Value, count: u64)`	Skips NULL; final = `sum / count` as Real

AVG always returns Real (SQL standard), even for integer columns. This avoids integer truncation (MySQL-style AVG(INT) returns DECIMAL but truncates in many contexts). AVG of all-NULL rows returns NULL.

Fast-path arithmetic (value_agg_add): For SUM, MIN, MAX, and COUNT, the accumulator is updated via direct arithmetic on Value variants, bypassing eval(). This eliminates the expression evaluator overhead for the innermost loop of the aggregate scan.

Ungrouped Aggregates

SELECT COUNT(*) FROM t (no GROUP BY) is handled as a single-group query with an empty key. Even on an empty table, the executor emits exactly one output row — (0) for COUNT(*), NULL for SUM/MIN/MAX/AVG. This matches the SQL standard and every major database.

Column Decode Mask

Before scanning, collect_expr_columns walks all expressions in SELECT items, WHERE, GROUP BY, HAVING, and ORDER BY to build a Vec<bool> mask indexed by column position. Only columns with mask[i] == true are decoded from the row bytes. For a SELECT age, AVG(score) FROM users GROUP BY age query on a 6-column table, this skips decoding name and email (TEXT fields) entirely.

The mask is forwarded to scan_clustered_table_masked as Option<&[bool]> and passed into decode_row_masked at the codec level, which skips variable-length fields that are not needed.

GROUP BY — Sorted Streaming Executor (Phase 4.9b)

The sorted executor replaces the hash table with a single linear pass over pre-ordered rows, accumulating state for the current group and emitting it when the key changes.

Algorithm

rows_with_keys = [(row, eval(group_by exprs, row)) for row in combined_rows]

if !presorted:
    stable_sort rows_with_keys by compare_group_key_lists

current_key   = rows_with_keys[0].key_values
current_accumulators = AggAccumulator::new() for each aggregate
update accumulators with rows_with_keys[0].row

for next in rows_with_keys[1..]:
    if group_keys_equal(current_key, next.key_values):
        update accumulators with next.row
    else:
        finalize → apply HAVING → emit output row
        reset: current_key = next.key_values, new accumulators, update

finalize last group

Key Comparison

#![allow(unused)]
fn main() {
fn compare_group_key_lists(a: &[Value], b: &[Value]) -> Ordering
fn group_keys_equal(a: &[Value], b: &[Value]) -> bool
}

Uses compare_values_null_last so NULL == NULL for grouping (consistent with the hash path’s serialization). Comparison is left-to-right: returns the first non-Equal ordering.

Shared Aggregate Machinery

Both hash and sorted executors reuse the same:

AggAccumulator (state, update, finalize)
eval_with_aggs (HAVING evaluation)
project_grouped_row (output projection)
build_grouped_column_meta (column metadata)
GROUP_CONCAT handling
Post-group DISTINCT / ORDER BY / LIMIT

🚀

Memory Advantage When the index already orders rows by the GROUP BY key prefix, the sorted executor uses O(1) accumulator memory (one group at a time) instead of O(k) where k = distinct groups. For a high-cardinality column with many distinct values, this eliminates the entire hash table allocation.

ORDER BY — Multi-Column Sort

ORDER BY is applied after scan + filter + aggregation but before projection for non-GROUP BY queries. For GROUP BY queries, it is applied to the projected output rows after remap_order_by_for_grouped rewrites column references.

ORDER BY in GROUP BY Context — Expression Remapping

Grouped output rows are indexed by SELECT output position: position 0 = first SELECT item, position 1 = second, etc. ORDER BY expressions, however, are analyzed against the source schema where Expr::Column { col_idx } refers to the original table column.

remap_order_by_for_grouped fixes this mismatch before calling apply_order_by:

remap_order_by_for_grouped(order_by, select_items):
  for each ORDER BY item:
    rewrite expr via remap_expr_for_grouped(expr, select_items)

remap_expr_for_grouped(expr, select_items):
  if expr == select_items[pos].expr (structural PartialEq):
    return Column { col_idx: pos }   // output position
  match expr:
    BinaryOp → recurse into left, right
    UnaryOp  → recurse into operand
    IsNull   → recurse into inner
    Between  → recurse into expr, low, high
    Function → recurse into args
    other    → return unchanged

This means ORDER BY dept (where dept is Expr::Column{col_idx:1} in the source) becomes Expr::Column{col_idx:0} when the SELECT is SELECT dept, COUNT(*), correctly indexing into the projected output row.

Aggregate expressions like ORDER BY COUNT(*) are matched structurally: if Expr::Function{name:"count", args:[]} appears in the SELECT at position 1, it is rewritten to Expr::Column{col_idx:1}.

⚙️

Design Decision — Expr PartialEq matching Rather than maintaining a separate alias/position resolution table, AxiomDB uses structural PartialEq on Expr (which is derived) to identify ORDER BY expressions that match SELECT items. This is simpler than PostgreSQL's SortClause/TargetEntry reference system and correct for the common cases (column references, aggregates, compound expressions).

NULL Ordering Defaults (PostgreSQL-compatible)

Direction	Default	Override
`ASC`	NULLs LAST	`NULLS FIRST`
`DESC`	NULLs FIRST	`NULLS LAST`

compare_sort_values(a, b, direction, nulls_override):
  nulls_first = explicit_nulls_order OR (DESC && no explicit)
  if a = NULL and b = NULL → Equal
  if a = NULL → Less if nulls_first, else Greater
  if b = NULL → Greater if nulls_first, else Less
  otherwise → compare a and b, reverse if DESC

Non-NULL comparison delegates to eval(BinaryOp{Lt}, Literal(a), Literal(b)) via the expression evaluator, reusing all type coercion and promotion logic.

Error Propagation from sort_by

Rust’s sort_by closure cannot return Result. AxiomDB uses the sort_err pattern: errors are captured in Option<DbError> during the sort and returned after it completes.

#![allow(unused)]
fn main() {
let mut sort_err: Option<DbError> = None;
rows.sort_by(|a, b| {
    match compare_rows_for_sort(a, b, order_items) {
        Ok(ord) => ord,
        Err(e)  => { sort_err = Some(e); Equal }
    }
});
if let Some(e) = sort_err { return Err(e); }
}

DISTINCT — Deduplication

SELECT DISTINCT is applied after projection and before LIMIT/OFFSET, using a HashSet<Vec<u8>> keyed by value_to_key_bytes.

fn apply_distinct(rows: Vec<Row>) -> Vec<Row>:
    seen = HashSet::new()
    for row in rows:
        key = concat(value_to_key_bytes(v) for v in row)
        if seen.insert(key):   // first occurrence
            keep row

Two rows are identical if every column value serializes to the same bytes. Critically, NULL → [0x00] means two NULLs are considered equal for deduplication — only one row with a NULL in that position is kept. This is the SQL standard behavior for DISTINCT, and is different from equality comparison where NULL = NULL returns UNKNOWN.

LIMIT / OFFSET — Row-Count Coercion (Phase 4.10d)

apply_limit_offset runs after ORDER BY and DISTINCT. It calls eval_row_count_as_usize for each row-count expression.

Row-count coercion contract

Evaluated value	Result
`Int(n)` where `n ≥ 0`	`n as usize`
`BigInt(n)` where `n ≥ 0`	`usize::try_from(n)` — errors on overflow
`Text(s)` where `s.trim()` parses as an exact base-10 integer `≥ 0`	parsed value as `usize`
negative `Int` or `BigInt`	`DbError::TypeMismatch`
non-integral `Text` (`"10.1"`, `"1e3"`, `"abc"`)	`DbError::TypeMismatch`
`NULL`, `Bool`, `Real`, `Decimal`, `Date`, `Timestamp`	`DbError::TypeMismatch`

Text coercion is intentionally narrow: only exact base-10 integers are accepted. Scientific notation, decimal fractions, and time-like strings are all rejected.

Why Text is accepted

The prepared-statement SQL-string substitution path serializes a Value::Text("2") parameter as LIMIT '2' in the generated SQL. Without Text coercion, the fallback path would always fail for string-bound LIMIT parameters — which is the binding type used by some MariaDB clients. Accepting exact integer Text keeps the cached-AST prepared path and the SQL-string fallback path on identical semantics.

⚙️

Design Decision AxiomDB does not call the general coerce() function here. coerce() uses assignment-coercion semantics and would change the error class to InvalidCoercion, masking the semantic error. eval_row_count_as_usize implements the narrower 4.10d contract directly in the executor, keeping the error class and message family consistent for both prepared paths.

INSERT … SELECT — MVCC Isolation

INSERT INTO target SELECT ... FROM source executes the SELECT phase under the same snapshot as any other read in the transaction — fixed at BEGIN.

This prevents the “Halloween problem”: rows inserted by this INSERT have txn_id_created = current_txn_id. The snapshot was taken before any insert occurred, so snapshot_id ≤ current_txn_id. The MVCC visibility rule (txn_id_created < snapshot_id) causes newly inserted rows to be invisible to the SELECT scan. The result:

If source = target (inserting from a table into itself): the SELECT sees exactly the rows that existed at BEGIN. The inserted copies are not re-scanned. No infinite loop.
If another transaction inserts rows into source after this transaction’s BEGIN: those rows are also invisible (consistent snapshot).

Before BEGIN:  source = {row1, row2}
After BEGIN:   snapshot_id = 3  (max_committed = 2)

INSERT INTO source SELECT * FROM source:
  SELECT sees:  {row1 (xmin=1), row2 (xmin=2)} — both have xmin < snapshot_id ✅
  Inserts:      row3 (xmin=3), row4 (xmin=3) — xmin = current_txn_id = 3
  SELECT does NOT see row3 or row4 (xmin ≮ snapshot_id) ✅

After COMMIT:  source = {row1, row2, row3, row4}  ← exactly 2 new rows, not infinite

Subquery Execution

Subquery execution is integrated into the expression evaluator via the SubqueryRunner trait. This design allows the compiler to eliminate all subquery dispatch overhead in the non-subquery path at zero runtime cost.

SubqueryRunner Trait

#![allow(unused)]
fn main() {
pub trait SubqueryRunner {
    fn eval_scalar(&mut self, subquery: &SelectStmt) -> Result<Value, DbError>;
    fn eval_in(&mut self, subquery: &SelectStmt, needle: &Value) -> Result<Value, DbError>;
    fn eval_exists(&mut self, subquery: &SelectStmt) -> Result<bool, DbError>;
}
}

All expression evaluation is dispatched through eval_with<R: SubqueryRunner>:

#![allow(unused)]
fn main() {
pub fn eval_with<R: SubqueryRunner>(
    expr: &Expr,
    row: &Row,
    runner: &mut R,
) -> Result<Value, DbError>
}

Two concrete implementations exist:

Implementation	Purpose
`NoSubquery`	Used for simple expressions with no subqueries. All three `SubqueryRunner` methods are `unreachable!()`. Monomorphization guarantees they are dead code.
`ExecSubqueryRunner<'a>`	Used when the query contains at least one subquery. Holds mutable references to storage, the transaction manager, and the outer row for correlated access.

⚙️

Design Decision — Generic Trait Monomorphization Using SubqueryRunner as a generic trait parameter — rather than a runtime Option<&mut dyn FnMut> or a boolean flag — allows the compiler to generate two separate code paths: eval_with::<NoSubquery> and eval_with::<ExecSubqueryRunner>. In the NoSubquery path, every subquery branch is dead code and is eliminated by LLVM. A runtime option would add a pointer-width check plus a potential indirect call on every expression node evaluation, even for the 99% of expressions that have no subqueries.

Scalar Subquery Evaluation

ExecSubqueryRunner::eval_scalar executes the inner SelectStmt fully using the existing execute_select path, then inspects the result:

eval_scalar(subquery):
  result = execute_select(subquery, storage, txn)
  match result.rows.len():
    0     → Value::Null
    1     → result.rows[0][0]   // single column, single row
    n > 1 → Err(CardinalityViolation { returned: n })

The inner SELECT is always run with a fresh output context. It inherits the outer transaction snapshot so it sees the same consistent view as the outer query.

IN Subquery Evaluation

eval_in materializes the subquery result into a HashSet<Value>, then applies three-valued logic:

eval_in(subquery, needle):
  rows = execute_select(subquery)
  values: HashSet<Value> = rows.map(|r| r[0]).collect()

  if values.contains(needle):
    return Value::Bool(true)
  if values.contains(Value::Null):
    return Value::Null       // unknown — could match
  return Value::Bool(false)

For NOT IN, the calling code wraps the result: TRUE → FALSE, FALSE → TRUE, NULL → NULL (NULL propagates unchanged).

EXISTS Evaluation

eval_exists executes the subquery and checks whether the result set is non-empty. No rows are materialized beyond the first:

eval_exists(subquery):
  rows = execute_select(subquery)
  return !rows.is_empty()   // always bool, never null

Correlated Subqueries — `substitute_outer`

Before executing a correlated subquery, ExecSubqueryRunner walks the subquery AST and replaces every Expr::OuterColumn { col_idx, depth: 1 } with a concrete Expr::Literal(value) from the current outer row. This operation is called substitute_outer:

substitute_outer(expr_tree, outer_row):
  for each node in expr_tree:
    if node = OuterColumn { col_idx, depth: 1 }:
      replace with Literal(outer_row[col_idx])
    if node = OuterColumn { col_idx, depth: d > 1 }:
      decrement depth by 1  // pass through for deeper nesting

After substitution, the subquery is a fully self-contained statement with no outer references, and it is executed by the standard execute_select path.

Re-execution happens once per outer row: for a correlated EXISTS in a query that produces 10,000 outer rows, the inner query is executed 10,000 times. For large datasets, rewriting as a JOIN is recommended.

Derived Table Execution

A derived table (FROM (SELECT ...) AS alias) is materialized once at the start of query execution, before any scan or filter of the outer query begins:

execute_select(stmt):
  for each TableRef::Derived { subquery, alias } in stmt.from:
    materialized[alias] = execute_select(subquery)   // fully materialized in memory
  // outer query scans materialized[alias] as if it were a base table

The materialized result is an in-memory Vec<Row> wrapped in a MaterializedTable. The outer query uses the derived table’s output schema (column names from the inner SELECT list) for column resolution.

Derived tables are not correlated — they cannot reference columns from the outer query. Lateral joins (which allow correlation in FROM) are not yet supported.

Foreign Key Enforcement

FK constraints are validated during DML operations by crates/axiomdb-sql/src/fk_enforcement.rs.

Catalog Storage

Each FK is stored as a FkDef row in the axiom_foreign_keys heap (5th system table, root page at meta offset 84). Fields:

fk_id, child_table_id, child_col_idx, parent_table_id, parent_col_idx,
on_delete: FkAction, on_update: FkAction, fk_index_id: u32, name: String

FkAction encoding: 0=NoAction, 1=Restrict, 2=Cascade, 3=SetNull, 4=SetDefault. fk_index_id != 0 → FK auto-index exists (composite key, Phase 6.9). fk_index_id = 0 → no auto-index; enforcement falls back to full table scan.

FK auto-index — composite key `(fk_val | RecordId)` (Phase 6.9)

Each FK constraint auto-creates a B-Tree index on the child FK column using a composite key format that makes every entry globally unique:

key = encode_index_key(&[fk_val]) ++ encode_rid(rid)  (10 bytes RecordId suffix)

This follows InnoDB’s approach of appending the primary key as a tiebreaker (row0row.cc). Every entry is unique even when many rows share the same FK value.

Range scan for all children with a given parent key:

#![allow(unused)]
fn main() {
lo = encode_index_key(&[parent_key]) ++ [0x00; 10]  // smallest RecordId
hi = encode_index_key(&[parent_key]) ++ [0xFF; 10]  // largest RecordId
children = BTree::range_in(fk_index_root, lo, hi)   // O(log n + k)
}

INSERT / UPDATE child — `check_fk_child_insert`

For each FK on the child table:
  1. FK column is NULL → skip (MATCH SIMPLE)
  2. Encode FK value as B-Tree key
  3. Find parent's PK or UNIQUE index covering parent_col_idx
  4. Bloom shortcut: if filter says absent → ForeignKeyViolation immediately
  5. BTree::lookup_in(parent_index_root, key) — O(log n)
  6. No match → ForeignKeyViolation (SQLSTATE 23503)

PK indexes are populated on every INSERT since Phase 6.9 (removed !is_primary filter in insert_into_indexes). All index types now use B-Tree + Bloom lookup.

DELETE parent — `enforce_fk_on_parent_delete`

Called before the parent rows are deleted. For each FK referencing this table:

Action	Behavior
RESTRICT / NO ACTION	`BTree::range_in(fk_index)` — O(log n); error if any child found
CASCADE	Range scan finds all children; recursive delete (depth limit = 10)
SET NULL	Range scan finds all children; updates FK column to NULL

Cascade recursion uses depth parameter — exceeding 10 levels returns ForeignKeyCascadeDepth (SQLSTATE 23503).

🚀

Performance Advantage Phase 6.9 replaced full table scans with B-Tree range scans for FK enforcement. RESTRICT check: O(log n) vs O(n). CASCADE with 1,000 children: O(log n + 1000) vs O(n × 1000). This follows InnoDB's composite secondary index approach (`dict_foreign_t.foreign_index`) rather than PostgreSQL's trigger-based strategy.

Query Planner Cost Gate (Phase 6.10)

Before returning IndexLookup or IndexRange, plan_select applies a cost gate using per-column statistics to decide if the index scan is worth the overhead.

Algorithm

ndv = stats.ndv > 0 ? stats.ndv : DEFAULT_NUM_DISTINCT (= 200)
selectivity = 1.0 / ndv        // equality predicate: 1/ndv rows match
if selectivity > 0.20:
    return Scan                 // too many rows — full scan is cheaper
if stats.row_count < 1,000:
    return Scan                 // tiny table — index overhead not worth it
return IndexLookup / IndexRange // selective enough — use index

Constants derived from PostgreSQL:

INDEX_SELECTIVITY_THRESHOLD = 0.20 (PG default: seq/random_page_cost = 0.25; AxiomDB is slightly more conservative for embedded storage)
DEFAULT_NUM_DISTINCT = 200 (PG DEFAULT_NUM_DISTINCT in selfuncs.c)

Stats are loaded once per SELECT

In execute_select_ctx, before calling plan_select:

#![allow(unused)]
fn main() {
let table_stats = CatalogReader::new(storage, snap)?.list_stats(table_id)?;
let access_method = plan_select(where_clause, indexes, columns, table_id,
                                &table_stats, &mut ctx.stats);
}

If table_stats is empty (pre-6.10 database or ANALYZE never run), plan_select conservatively uses the index — never wrong, just possibly suboptimal.

Staleness (`StaleStatsTracker`)

StaleStatsTracker lives in SessionContext and tracks row changes per table:

INSERT / DELETE row  → on_row_changed(table_id)
changes > 20% of baseline  → mark stale
planner loads stats  → set_baseline(table_id, row_count)
ANALYZE TABLE        → mark_fresh(table_id)

When stale, the planner uses ndv = DEFAULT_NUM_DISTINCT = 200 regardless of catalog stats, preventing stale low-NDV estimates from causing full scans on high-selectivity columns.

Bloom Filter — Index Lookup Shortcut

The executor holds a BloomRegistry (one per database connection) that maps index_id → Bloom<Vec<u8>>. Before performing any B-Tree lookup for an index equality predicate, the executor consults the filter:

#![allow(unused)]
fn main() {
// In execute_select_ctx — IndexLookup path
if !bloom.might_exist(index_def.index_id, &encoded_key) {
    // Definite absence: skip B-Tree entirely.
    return Ok(vec![]);
}
// False positive or true positive: proceed with B-Tree.
BTree::lookup_in(storage, index_def.root_page_id, &encoded_key)?
}

BloomRegistry API

#![allow(unused)]
fn main() {
pub struct BloomRegistry { /* per-index filters */ }

impl BloomRegistry {
    pub fn create(&mut self, index_id: u32, expected_items: usize);
    pub fn add(&mut self, index_id: u32, key: &[u8]);
    pub fn might_exist(&self, index_id: u32, key: &[u8]) -> bool;
    pub fn mark_dirty(&mut self, index_id: u32);
    pub fn remove(&mut self, index_id: u32);
}
}

might_exist returns true (conservative) for unknown index_ids — correct behavior for indexes that existed before the current server session (no filter was populated for them at startup).

DML Integration

Every DML handler in the execute_with_ctx path updates the registry:

Handler	Bloom action
`execute_insert_ctx`	`bloom.add(index_id, &key)` after each B-Tree insert
`execute_update_ctx`	`mark_dirty()` for delete side (batch); `add()` for insert side
`execute_delete_ctx`	`mark_dirty(index_id)` once per index batch (5.19)
`execute_create_index`	`create(index_id, n)` then `add()` for every existing key
`execute_drop_index`	`remove(index_id)`

Memory Budget

Each filter is sized at max(2 × expected_items, 1000) with a 1% FPR target (~9.6 bits/key, 7 hash functions). For a 1M-row table with one secondary index: 2M × 9.6 bits ≈ 2.4 MB.

⚙️

Design Decision — Standard Bloom, Not Counting A standard (non-counting) Bloom filter is used instead of a counting variant. Deleted keys cannot be removed — the filter is marked dirty instead. This avoids the 4× memory overhead of counting Bloom filters (used by Apache Cassandra and some RocksDB SST configurations) while maintaining full correctness: dirty filters produce more false positives but never false negatives. Reconstruction is deferred to ANALYZE TABLE (Phase 6.12), mirroring PostgreSQL's lazy statistics-rebuild model.

IndexOnlyScan — Heap-Free Execution

When plan_select returns AccessMethod::IndexOnlyScan, the executor reads all result values directly from the B-Tree key bytes, with only a lightweight MVCC visibility check against the heap slot header.

This section applies to the heap executor path. Since 39.15, clustered tables do not execute this path directly even if the planner initially detects a covering opportunity. Clustered covering plans are normalized back to clustered-aware lookup/range access, because clustered visibility lives in the inline row header and clustered secondary indexes carry PK bookmarks instead of stable heap RecordIds.

⚙️

Design Decision — Heap-Only IndexOnlyScan PostgreSQL can keep one logical access method because every index probe still lands on a heap TID. Clustered storage breaks that assumption. AxiomDB keeps `IndexOnlyScan` as a heap optimization and routes clustered reads through the clustered tree until a real clustered covering-read path exists.

Clustered UPDATE (`39.16`)

Clustered tables no longer fall back to heap-era UPDATE logic. The executor now routes explicit-PRIMARY KEY tables through clustered candidate discovery and clustered rewrite primitives:

discover candidates through the clustered access planner: PK lookup, PK range, secondary bookmark probe, or full clustered scan
capture the exact old clustered row image (RowHeader + full logical row bytes) before any mutation
choose one of three clustered write paths:
- same-key in-place rewrite via clustered_tree::update_in_place(...)
- same-key relocation via clustered_tree::update_with_relocation(...)
- PK change via delete_mark(old_pk) + insert(new_pk, ...)
rewrite clustered secondary bookmark entries and register both index-insert and index-delete undo records so rollback can restore the old bookmark state

⚙️

Design Decision — InnoDB-Style Branching MariaDB/InnoDB splits UPDATE into clustered in-place vs. delete+insert depending on whether index-ordering columns change. AxiomDB now applies the same decision tree to clustered SQL UPDATE, but stores rollback state as exact row images keyed by primary key instead of heap-era slot addresses.

Clustered DELETE (`39.17`)

Clustered tables no longer fall back to heap-era DELETE logic either. The executor now routes explicit-PRIMARY KEY tables through clustered candidate discovery and clustered delete-mark primitives:

discover candidates through the clustered access planner: PK lookup, PK range, secondary bookmark probe, or full clustered scan
decode the exact current clustered row image before any mutation
enforce parent-side foreign-key restrictions before the first delete-mark
call clustered_tree::delete_mark(...) for each matched primary key
record EntryType::ClusteredDeleteMark with the exact old and new row images so rollback/savepoints restore the original header and payload bytes
leave clustered secondary bookmark entries in place for deferred cleanup during later clustered VACUUM work

⚙️

Design Decision — Delete Mark Before Purge InnoDB delete-marks clustered rows and purges them later; PostgreSQL also leaves tuple cleanup to VACUUM. AxiomDB now exposes the same split at SQL level: clustered DELETE changes visibility now, while physical removal and secondary cleanup stay in Phase 39.18.

Clustered VACUUM (`39.18`)

Clustered tables now have their own executor-visible maintenance path too. VACUUM table_name dispatches by table storage layout:

compute oldest_safe_txn = max_committed + 1
descend once to the leftmost clustered leaf
walk the next_leaf chain and remove cells whose txn_id_deleted is safe
free any overflow chain owned by each purged cell
defragment the leaf when freeblock waste exceeds the page-local threshold
scan each clustered secondary index, decode the PK bookmark from the physical secondary key, and keep only entries whose clustered row still exists physically after the leaf purge
persist any secondary root rotation caused by bulk delete back into the catalog

⚙️

Design Decision — Purge Is Not Visibility Using snapshot visibility directly for clustered secondary cleanup is wrong: an uncommitted delete is invisible to the writer snapshot but still owns its bookmark physically. `39.18` therefore cleans secondaries by clustered physical existence after purge, not by `lookup(..., snapshot)`.

Execution Path

IndexOnlyScan { index_def, lo, hi, n_key_cols, needed_key_positions }:

for (rid, key_bytes) in BTree::range_in(storage, index_def.root_page_id, lo, hi):
  page_id = rid.page_id
  slot_id = rid.slot_id

  // MVCC: read only the 24-byte RowHeader — no full row decode.
  visible = HeapChain::is_slot_visible(storage, page_id, slot_id, snap)
  if !visible:
    continue

  // Extract column values from B-Tree key bytes (no heap page needed).
  (decoded_cols, _) = decode_index_key(&key_bytes, n_key_cols)

  // Project only the columns the query requested.
  row = needed_key_positions.iter().map(|&p| decoded_cols[p].clone()).collect()
  emit row

The 24-byte RowHeader contains txn_id_created, txn_id_deleted, and a sequence number — enough for full MVCC visibility evaluation without loading the row payload.

`decode_index_key` — Self-Delimiting Key Decoder

decode_index_key lives in key_encoding.rs and is the exact inverse of encode_index_key. It uses type tags embedded in the key bytes to self-delimit each value without needing an external schema:

Tag byte	Type	Encoding
`0x00`	NULL	tag only, 0 payload bytes
`0x01`	Bool	tag + 1 byte (0 = false, 1 = true)
`0x02`	Int (positive, 1 B)	tag + 1 LE byte
`0x03`	Int (positive, 2 B)	tag + 2 LE bytes
`0x04`	Int (positive, 4 B)	tag + 4 LE bytes
`0x05`	Int (negative, 4 B)	tag + 4 LE bytes (i32)
`0x06`	BigInt (positive, 1 B)	tag + 1 byte
`0x07`	BigInt (positive, 4 B)	tag + 4 LE bytes
`0x08`	BigInt (positive, 8 B)	tag + 8 LE bytes
`0x09`	BigInt (negative, 8 B)	tag + 8 LE bytes (i64)
`0x0A`	Real	tag + 8 LE bytes (f64 bits)
`0x0B`	Text	tag + NUL-terminated UTF-8 (NUL = end marker)
`0x0C`	Bytes	tag + NUL-escaped bytes (0x00 → [0x00, 0xFF], NUL terminator = [0x00, 0x00])

#![allow(unused)]
fn main() {
// Signature
pub fn decode_index_key(key: &[u8], n_cols: usize) -> (Vec<Value>, usize)
// Returns: (decoded column values, total bytes consumed)
}

The self-delimiting format means decode_index_key requires no column type metadata — the tag bytes carry all necessary type information. This is the same approach used by SQLite’s record format and RocksDB’s comparator-encoded keys.

Full-Width Row Layout in IndexOnlyScan Output

IndexOnlyScan emits full-width rows — the same width as a heap row — with index key column values placed at their table col_idx positions and NULL everywhere else. This is required because downstream operators (WHERE re-evaluation, projection, expression evaluator) all address columns by their original table column index, not by SELECT output position.

table: (id INT [0], name TEXT [1], age INT [2], dept TEXT [3])
index: ON (age, dept)  ← covers col_idx 2 and 3

IndexOnlyScan emits: [NULL, NULL, <age_val>, <dept_val>]
                      col0  col1    col2         col3

If the executor placed decoded values at positions 0, 1, ... instead, a WHERE age > 25 re-evaluation would read col_idx=2 from a 2-element row and panic with ColumnIndexOutOfBounds. The full-width layout eliminates this class of error entirely.

`execute_with_ctx` — Required for IndexOnlyScan Selection

The planner selects IndexOnlyScan only when select_col_idxs (the set of columns touched by the query) is a subset of the index’s key columns. The select_col_idxs argument is supplied by execute_with_ctx; the simpler execute entry-point passes an empty slice, so IndexOnlyScan is never selected through it.

Test coverage for this path lives in crates/axiomdb-sql/tests/integration_index_only.rs — functions prefixed test_ctx_ use execute_with_ctx with real select_col_idxs and are the only tests that exercise the IndexOnlyScan access method end-to-end.

Non-Unique Secondary Index Key Format

Non-unique secondary indexes append a 10-byte RecordId suffix to every B-Tree key to guarantee uniqueness across all entries:

key = encode_index_key(col_vals) || encode_rid(rid)
                                    ^^^^^^^^^^^^^^
                                    page_id (4 B) + slot_id (2 B) + seq (4 B) = 10 bytes

This prevents DuplicateKey errors when two rows share the same indexed value, because the RecordId suffix always makes the full key distinct.

Lookup Bounds for Non-Unique Indexes

To find all rows matching a specific indexed value, the executor performs a range scan using synthetic [lo, hi] bounds that span all possible RecordId suffixes:

#![allow(unused)]
fn main() {
lo = encode_index_key(&[val]) + [0x00; 10]   // smallest RecordId
hi = encode_index_key(&[val]) + [0xFF; 10]   // largest RecordId
BTree::range_in(root, lo, hi)                // returns all entries for val
}

⚙️

Design Decision — InnoDB Secondary Index Approach MySQL InnoDB secondary indexes append the primary key as a tiebreaker in every non-unique B-Tree entry (row0row.cc). AxiomDB uses RecordId (page_id + slot_id + sequence) instead of a separate primary key column, keeping the suffix at a fixed 10 bytes regardless of the table's key type — simpler to encode and guaranteed to be globally unique within the storage engine's address space.

Performance Characteristics

Operation	Time complexity	Notes
Table scan	O(n)	HeapChain linear traversal
Nested loop JOIN	O(n × m)	Both sides materialized before loop
Hash GROUP BY	O(n)	One pass; O(k) memory where k = distinct groups
Sorted GROUP BY	O(n)	One pass; O(1) accumulator memory per group
Sort ORDER BY	O(n log n)	`sort_by` (stable, in-memory)
DISTINCT	O(n)	One HashSet pass
LIMIT/OFFSET	O(1) after sort	`skip(offset).take(limit)`

All operations are in-memory for Phase 4. External sort and hash spill for large datasets are planned for Phase 14 (vectorized execution).

AUTO_INCREMENT Execution

Per-Table Sequence State

Each table that has an AUTO_INCREMENT column maintains a sequence counter. The counter is stored as a thread-local HashMap<String, i64> keyed by table name, lazily initialized on the first INSERT:

auto_increment_next(table_name):
  if table_name not in thread_local_map:
    max_existing = MAX(id) from HeapChain scan, or 0 if table is empty
    thread_local_map[table_name] = max_existing + 1
  value = thread_local_map[table_name]
  thread_local_map[table_name] += 1
  return value

The MAX+1 lazy-init strategy means the sequence is always consistent with existing data, even after rows are inserted by a previous session or after a crash recovery.

⚙️

Design Decision — Thread-Local vs Per-Session State The sequence counter is stored in thread-local storage rather than attached to a session object. Phase 4 uses a single-threaded executor, so thread-local and session-local are equivalent. This avoids the complexity of a session handle threading through every call site. When Phase 7 introduces concurrent sessions, the counter will migrate to per-session state. The lazy-init from MAX+1 is compatible with either approach.

Explicit Value Bypass

When the INSERT column list includes the AUTO_INCREMENT column with a non-NULL value, the explicit value is used directly and the sequence counter is not advanced:

for each row to insert:
  if auto_increment_col in provided_columns:
    value = provided value   // bypass — no counter update
  else:
    value = auto_increment_next(table_name)
    session.last_insert_id = value   // update only for generated IDs

LAST_INSERT_ID() is updated only when a value is auto-generated. Inserting an explicit ID does not change the session’s last_insert_id.

Multi-Row INSERT

For INSERT INTO t VALUES (...), (...), ..., the executor calls auto_increment_next once per row. last_insert_id is set to the value generated for the first row before iterating through the rest:

ids = [auto_increment_next(t) for _ in rows]
session.last_insert_id = ids[0]   // MySQL semantics
insert all rows with their respective ids

TRUNCATE — Sequence Reset

TRUNCATE TABLE t deletes all rows by scanning the HeapChain and marking every visible row as deleted (same algorithm as DELETE FROM t without a WHERE clause). After clearing the rows, it resets the sequence:

execute_truncate(table_name):
  for row in HeapChain::scan_visible(table_name, snapshot):
    storage.delete_row(row.record_id, txn_id)
  thread_local_map.remove(table_name)   // next insert re-initializes from MAX+1 = 1
  return QueryResult::Affected { count: 0 }

Removing the entry from the map forces a MAX+1 re-initialization on the next INSERT. Because the table is now empty, MAX = 0, so next = 1.

SHOW TABLES / SHOW COLUMNS

SHOW TABLES

SHOW TABLES [FROM schema] reads the catalog’s table registry and returns one row per table. The output column is named Tables_in_<schema>:

execute_show_tables(schema):
  tables = catalog.list_tables(schema)
  column_name = "Tables_in_" + schema
  return QueryResult::Rows { columns: [column_name], rows: [[t] for t in tables] }

SHOW COLUMNS / DESCRIBE

SHOW COLUMNS FROM t, DESCRIBE t, and DESC t are all dispatched to the same handler. The executor reads the column definitions from the catalog and constructs a fixed six-column result set:

execute_show_columns(table_name):
  cols = catalog.get_table(table_name).columns
  for col in cols:
    Field   = col.name
    Type    = col.data_type.to_sql_string()
    Null    = if col.nullable { "YES" } else { "NO" }
    Key     = if col.is_primary_key { "PRI" } else { "" }
    Default = "NULL"   // stub
    Extra   = if col.auto_increment { "auto_increment" } else { "" }
  return six-column result set

The Key and Default fields are stubs: Key only reflects primary key membership; composite keys, unique constraints, and foreign keys are not yet surfaced. Default always shows "NULL" regardless of the declared default expression. Full metadata exposure is planned for a later catalog enhancement.

ALTER TABLE Execution

ALTER TABLE dispatches to one of five handlers depending on the operation. Three of them (ADD COLUMN, DROP COLUMN, and MODIFY COLUMN) require rewriting every row in the table. The other two (RENAME COLUMN and RENAME TO) touch only the catalog.

Why Row Rewriting Is Needed

AxiomDB rows are stored as positional binary blobs. The null bitmap at the start of each row has exactly ceil(column_count / 8) bytes — one bit per column, in column-index order. Packed values follow immediately, with offsets derived from the column types declared at write time.

Row layout (schema: id BIGINT, name TEXT, age INT):

  null_bitmap (1 byte)   [b0=id_null, b1=name_null, b2=age_null, ...]
  id   (8 bytes, LE i64) [only present if b0=0]
  name (4-byte len + UTF-8 bytes) [only present if b1=0]
  age  (4 bytes, LE i32) [only present if b2=0]

When the column count changes, the null bitmap size changes and all subsequent offsets shift. A row written under the old schema cannot be decoded against the new schema — the null bitmap has the wrong number of bits, and value positions no longer align. Every row must therefore be rewritten to match the new layout.

RENAME COLUMN does not change column positions or types — only the name entry in the catalog changes. RENAME TO changes only the table name in the catalog. Neither operation touches row data.

`rewrite_rows` Helper

ADD COLUMN, DROP COLUMN, and MODIFY COLUMN all use a shared rewrite_rows dispatch. The implementation branches on storage format:

Heap tables:

rewrite_rows(table_name, old_schema, new_schema, transform_fn):
  snapshot = txn.active_snapshot()
  old_rows = HeapChain::scan_visible(table_name, snapshot)

  for (record_id, old_row) in old_rows:
    new_row = transform_fn(old_row)?   // apply per-operation transformation
    storage.delete_row(record_id, txn_id)
    storage.insert_row(table_name, encode_row(new_row, new_schema), txn_id)

Clustered tables (rewrite_rows_clustered):

Clustered tables cannot use heap delete+reinsert because clustered_tree::insert rejects duplicate primary keys even when the previous row is delete-marked. Instead, each row is rewritten in place using update_with_relocation:

rewrite_rows_clustered(table_id, old_schema, new_schema, transform_fn):
  snapshot = txn.active_snapshot()
  rows = clustered_tree::range(table_id, Unbounded, Unbounded, snapshot)

  for ClusteredRow { key, row_header, row_data } in rows:
    old_row = decode_row(row_data, old_schema)
    new_row = transform_fn(old_row)?
    new_data = encode_row(new_row, new_schema)
    txn.record_clustered_update(table_id, key, row_header+row_data, new_data)
    new_root = clustered_tree::update_with_relocation(key, new_data)
    if let Some(new_root_pid) = new_root {
        catalog.set_root_page(table_id, new_root_pid)
    }

update_with_relocation tries an in-place rewrite of the leaf slot. If the new row is larger and the leaf page is full, it falls back to physical delete + reinsert at the correct leaf position (no duplicate-key issue because the old entry is physically removed before the new one is inserted).

The transform_fn is operation-specific and returns Result<Row, DbError> so coercion failures abort the entire statement:

Operation	transform_fn
ADD COLUMN	Append `DEFAULT` value (or `NULL` if no default) to the end of the row
DROP COLUMN	Remove the value at `col_idx` from the row vector
MODIFY COLUMN	Replace value at `col_idx` with `coerce(value, new_type, Strict)?`

Ordering Constraint — Catalog Before vs. After Rewrite

The ordering of the catalog update relative to the row rewrite is not arbitrary. It is chosen so that a failure mid-rewrite leaves the database in a recoverable state:

ADD COLUMN — catalog update FIRST, then rewrite rows:

1. catalog.add_column(table_name, new_column_def)
2. rewrite_rows(old_schema → new_schema, append DEFAULT)

If the process crashes after step 1 but before step 2 completes, the catalog already reflects the new schema. The partially-rewritten rows are discarded by crash recovery (their transactions are uncommitted). On restart, the table is consistent: the new column exists in the catalog, and all rows either have been fully rewritten (if the transaction committed) or none have been (if it was rolled back).

DROP COLUMN — rewrite rows FIRST, then update catalog:

1. rewrite_rows(old_schema → new_schema, remove col at col_idx)
2. catalog.remove_column(table_name, col_idx)

If the process crashes after step 1 but before step 2, the rows have already been written in the new (narrower) layout but the catalog still shows the old schema. Recovery rolls back the uncommitted row rewrites and the catalog is never touched — the table is fully consistent under the old schema.

MODIFY COLUMN — rewrite rows FIRST (with strict coercion), then update catalog:

1. Guard: column not in secondary index (type change would break key encoding)
2. Guard: PK column cannot become nullable on clustered table
3. rewrite_rows(old_schema → new_schema, coerce(val, new_type, Strict)?)
4. catalog.delete_column(table_id, col_idx)
5. catalog.create_column(new_ColumnDef)  // same col_idx, new type/nullable

If coercion fails for any row (e.g. TEXT → INT on a non-numeric value), the error is returned immediately and no rows are changed. The statement is atomic: either all rows are coerced successfully or none are.

The invariant is: the catalog always describes rows that can be decoded. Swapping the order for either operation would create a window where the catalog describes a schema that does not match the on-disk rows.

⚙️

Design Decision — Asymmetric Catalog Ordering ADD COLUMN updates the catalog before rewriting rows; DROP COLUMN and MODIFY COLUMN rewrite rows before updating the catalog. The direction is chosen so that a mid-operation crash always leaves the catalog consistent with whatever rows are on disk — partial rewrites are uncommitted transactions invisible to crash recovery. This mirrors the ordering used in PostgreSQL's heap rewrite path for ALTER TABLE.

⚙️

Design Decision — Clustered DDL Uses update_with_relocation Clustered tables cannot use heap-style delete+reinsert during row rewrites because clustered_tree::insert rejects duplicate primary keys even when the previous entry is delete-marked. Instead, rewrite_rows_clustered uses update_with_relocation: it rewrites the leaf slot in place, falling back to physical relocate-and-reinsert only when the new row is larger and the leaf has no room. This avoids the duplicate-key restriction entirely and keeps the PK-keyed B+ tree consistent throughout the rewrite.

Session Cache Invalidation

The session holds a SchemaCache that maps table names to their column definitions at the time the last query was prepared. After any ALTER TABLE operation completes, the cache entry for the affected table is invalidated:

execute_alter_table(stmt):
  // ... perform operation (catalog update + optional row rewrite) ...
  session.schema_cache.invalidate(table_name)

This ensures that the next query against the altered table re-reads the catalog and sees the updated column list, rather than operating on a stale schema that may reference columns that no longer exist or omit newly added ones.

Index root invalidation on B+tree split

The SchemaCache also stores IndexDef.root_page_id for each index. When an INSERT causes the B+tree root to split, insert_in allocates a new root page and frees the old one. After this, the cached root_page_id points to a freed page. If the cache is not invalidated, the next execute_insert_ctx call reads IndexDef.root_page_id from the cache and passes it to BTree::lookup_in (uniqueness check), causing a stale-pointer read on a freed page.

The fix: call ctx.invalidate_all() whenever any index root changes during INSERT or DELETE index maintenance. This forces re-resolution from the catalog (which always has the current root_page_id) on the next DML statement.

Since 5.19, DELETE and the old-key half of UPDATE no longer mutate indexes in a per-row loop. They collect exact encoded keys per index, sort them, and call delete_many_in(...) once per affected tree. The cache-invalidation rule still matters, but the synchronization point moved:

batch-delete old keys per index
persist the final root once for that index
update the in-memory current_indexes slice
invalidate the session cache once after the statement

For UPDATE there is a second root-sync point: after the batch delete phase, the reinsertion half must start from the post-delete root, not from the stale root captured before the batch. Otherwise reinserting new keys after a root collapse would descend from a freed page.

#![allow(unused)]
fn main() {
// DELETE / UPDATE old-key batch
let updated = delete_many_from_indexes(...)?;
for (index_id, new_root) in updated {
    catalog.update_index_root(index_id, new_root)?;
    current_indexes[i].root_page_id = new_root;
}

// UPDATE new-key insert phase
let ins_updated = insert_into_indexes(&current_indexes, ...)?;
}

⚙️

Borrowed Bulk-Delete Principle `5.19` follows the same high-level rule used by PostgreSQL's nbtree VACUUM path and InnoDB bulk helpers: when many exact keys from one index are already known, delete them page-locally in one ordered pass instead of re-entering the point-delete path N times.

Stable-RID UPDATE Fast Path (`5.20`)

5.19 removed the old-key delete bottleneck, but UPDATE still paid the full heap delete+insert path even when the new row could fit in the existing slot. 5.20 adds a second branch:

for each matched row:
  old_row = ...
  new_row = apply_set_assignments(old_row)

  if encoded(new_row) fits in old slot:
      rewrite tuple in place
      rid stays identical
      only maintain indexes whose logical key/predicate membership changed
  else:
      fallback to delete + insert
      rid changes
      treat affected indexes as before

The heap rewrite path is page-grouped. Rows that share a heap page are batched so the executor reads the page once, rewrites all eligible slots, then writes the page once. WAL records this branch as EntryType::UpdateInPlace, storing the old and new tuple images for the same (page_id, slot_id).

This does not implement PostgreSQL HOT chains or forwarding pointers. The Phase 5 rule is narrower and cheaper to reason about: same-slot rewrite only, otherwise fall back to the existing delete+insert path.

⚙️

Borrowed HOT-Lite Rule PostgreSQL HOT and DuckDB's in-place updates both rely on preserving row identity whenever the storage layout allows it. AxiomDB adapts the same idea without adding version chains: if the new encoded row still fits in the old slot, keep the RID and skip untouched indexes safely.

Clustered UPDATE In-Place Zero-Alloc Fast Path (Phase 39.22)

fused_clustered_scan_patch in executor/update.rs implements a zero-allocation UPDATE fast path for clustered tables when all SET columns are fixed-size.

Allocation audit

Allocation	Before 39.22	After 39.22
`cell.row_data.to_vec()` (phase-1 offset scan)	1× per matched row	❌ eliminated
`patched_data = ...clone()` (phase-2 mutation)	1× per matched row	❌ eliminated
`encode_cell_image()` in overflow path	1× per matched row	✅ overflow-only
`FieldDelta.old_bytes: Vec<u8>`	1× per changed field	❌ → `[u8;8]` inline
`FieldDelta.new_bytes: Vec<u8>`	1× per changed field	❌ → `[u8;8]` inline

For 25K rows with 1 changed column each: ~125K heap allocations → 0.

Two-phase borrow pattern

The Rust borrow checker requires releasing the immutable page borrow before taking a mutable one. The fast path uses a split-phase approach:

Read phase (immutable borrow on page):
  1. cell_row_data_abs_off(&page, idx) → (row_data_abs_off, key_len)
  2. compute_field_location_runtime(row_slice, bitmap) → FieldLocation
  3. MAYBE_NOP: if page_bytes[field_abs..][..size] == new_encoded[..size] { skip }
  4. Capture old_buf: [u8;8] and new_encoded: [u8;8] on the stack

Write phase (mutable borrow after immutable dropped):
  5. patch_field_in_place(&mut page, field_abs, new_buf[..size])
  6. update_row_header_in_place(&mut page, idx, &new_header)

MAYBE_NOP (byte-identity check)

If the new encoded bytes are byte-identical to the existing page bytes (e.g., SET score = score * 1 after integer multiplication), the field is skipped entirely — no WAL delta, no header bump, no page write for that field. This is an O(size) byte comparison (~4–8 bytes) before any mutation.

Overflow fallback

Cells with overflow_first_page.is_some() are rare (<1% of typical workloads) and fall back to the existing rewrite_cell_same_key_with_overflow path (full cell re-encode). The fast path only applies to inline cells.

🚀

Performance Advantage — 5 Allocations → 0 Per Row MariaDB's InnoDB in-place UPDATE (btr_cur_upd_rec_in_place) still allocates an undo record per row for ROLLBACK support. AxiomDB's UndoClusteredFieldPatch stores undo data as inline [u8;8] arrays in the undo log entry — zero heap allocation per row even for ROLLBACK support. For a 25K-row UPDATE t SET score = score + 1, this reduces allocator pressure from ~125K allocs to zero.

Strict Mode and Warning 1265

SessionContext.strict_mode is a bool flag (default true) that controls how INSERT and UPDATE column coercion failures are handled.

Coercion paths

INSERT / UPDATE column value assignment:
  if ctx.strict_mode:
    coerce(value, target_type, CoercionMode::Strict)
      → Ok(v)    : use v
      → Err(e)   : return Err immediately (SQLSTATE 22018)
  else:
    coerce(value, target_type, CoercionMode::Strict)
      → Ok(v)    : use v  (no warning — strict succeeded)
      → Err(_)   : try CoercionMode::Permissive
          → Ok(v) : use v, emit ctx.warn(1265, "Data truncated for column '<col>' at row <n>")
          → Err(e): return Err (both paths failed)

CoercionMode::Permissive performs best-effort conversion: '42abc' → 42, 'abc' → 0, overflowing integers clamped to the type bounds.

Row numbering

insert_row_with_ctx and insert_rows_batch_with_ctx accept an explicit row_num: usize (1-based). The VALUES loop in execute_insert_ctx passes row_idx + 1 from enumerate():

#![allow(unused)]
fn main() {
for (row_idx, value_exprs) in rows.into_iter().enumerate() {
    let values = eval_value_exprs(value_exprs, ...)?;
    engine.insert_row_with_ctx(&mut ctx, values, row_idx + 1)?;
}
}

This makes warning 1265 messages meaningful for multi-row inserts: "Data truncated for column 'stock' at row 2".

SET strict_mode / SET sql_mode

The executor intercepts SET strict_mode and SET sql_mode in execute_set_ctx (called from dispatch_ctx). It delegates to helpers from session.rs:

#![allow(unused)]
fn main() {
"strict_mode" => {
    let b = parse_boolish_setting(&raw)?;
    ctx.strict_mode = b;
}
"sql_mode" => {
    let normalized = normalize_sql_mode(&raw);
    ctx.strict_mode = sql_mode_is_strict(&normalized);
}
}

The wire layer (handler.rs) syncs the wire-visible @@sql_mode and @@strict_mode variables with the session bool after every intercepted SET. Both variables are surfaced in SHOW VARIABLES.

⚙️

Design Decision — Try Strict First, Then Permissive In permissive mode, AxiomDB always tries strict coercion first. A warning is only emitted when the strict path fails and the permissive path succeeds. This means values that coerce cleanly in strict mode (e.g. '42' → 42) never generate a warning in either mode — matching MySQL 8's behavior where warning 1265 is reserved for actual data loss, not clean widening.

Roadmap and Phases

AxiomDB is developed in phases, each of which adds a coherent vertical slice of functionality. The design is organized in three blocks:

Block 1 (Phases 1–7): Core engine — storage, indexing, WAL, transactions, SQL parsing, and concurrent MVCC.
Block 2 (Phases 8–14): SQL completeness — full query planner, optimizer, advanced SQL features, and MySQL wire protocol.
Block 3 (Phases 15–34): Production hardening — replication, backups, distributed execution, column store, and AI/ML integration.

Current Status

Last completed subphase: 40.1b CREATE INDEX on clustered tables — removed ensure_heap_runtime guard; CREATE INDEX / CREATE UNIQUE INDEX now work on clustered (PRIMARY KEY) tables using ClusteredSecondaryLayout-based index build with partial index, NULL-skipping, and uniqueness enforcement at build time.

Active development: Phase 40 — Clustered engine performance optimizations (40.1 ClusteredInsertBatch done; 40.1b CREATE INDEX on clustered tables done; statement plan cache, transaction write set, vectorized scan next)

Next milestone: 40.2 — Statement plan cache (per-session CachedPlanSource with OID-based invalidation)

Concurrency note: the current server already supports concurrent read-only queries, but mutating statements are still serialized through a database-wide Arc<RwLock<Database>> write guard. The next concurrency milestone is Phase 13.7 row-level locking, followed by deadlock detection and explicit locking clauses.

Phase Progress

Block 1 — Core Engine

Phase	Name	Status	Key deliverables
1.1	Workspace setup	✅	Cargo workspace, crate structure
1.2	Page format	✅	16 KB pages, header, CRC32c checksum
1.3	MmapStorage	✅	mmap-backed storage engine
1.4	MemoryStorage	✅	In-memory storage for tests
1.5	FreeList	✅	Bitmap page allocator
1.6	StorageEngine trait	✅	Unified interface + heap pages
2.1	B+ Tree insert/split	✅	CoW insert with recursive splits
2.2	B+ Tree delete	✅	Rebalance, redistribute, merge
2.3	B+ Tree range scan	✅	RangeIter with tree traversal
2.4	Prefix compression	✅	CompressedNode for internal keys
3.1	WAL entry format	✅	Binary format, CRC32c, backward scan
3.2	WAL writer	✅	WalWriter with file header
3.3	WAL reader	✅	Forward and backward iterators
3.4	TxnManager	✅	BEGIN/COMMIT/ROLLBACK, snapshot
3.5	Checkpoint	✅	5-step checkpoint protocol
3.6	Crash recovery	✅	CRASHED→RECOVERING→REPLAYING→VERIFYING→READY
3.7	Durability tests	✅	9 crash scenarios
3.8	Post-recovery checker	✅	Heap structural + MVCC invariants
3.9	Catalog bootstrap	✅	axiom_tables, axiom_columns, axiom_indexes
3.10	Catalog reader	✅	MVCC-aware schema lookup
3.17	WAL batch append	✅	`record_insert_batch()`: O(1) `write_all` for N entries via `reserve_lsns+write_batch`
3.18	WAL PageWrite	✅	`EntryType::PageWrite=9`: 1 WAL entry/page vs N/row; 238× fewer for 10K-row insert
3.19	WAL Group Commit	✅	`CommitCoordinator`: batches fsyncs across connections; up to 16× concurrent throughput
4.1	SQL AST	✅	All statement types
4.2	SQL lexer	✅	logos DFA, ~85 tokens, zero-copy
4.3	DDL parser	✅	CREATE/DROP/ALTER TABLE, CREATE/DROP INDEX
4.4	DML parser	✅	SELECT (all clauses), INSERT, UPDATE, DELETE
4.17	Expression evaluator	✅	Three-valued NULL logic, all operators
4.18	Semantic analyzer	✅	BindContext, col_idx resolution
4.18b	Type coercion matrix	✅	coerce(), coerce_for_op(), CoercionMode strict/permissive
4.23	QueryResult type	✅	Row, ColumnMeta, QueryResult (Rows/Affected/Empty)
4.5b	Table engine	✅	TableEngine scan/insert/delete/update over heap; later generalized by Phase 39 table-root metadata
4.5 + 4.5a	Basic executor	✅	SELECT/INSERT/UPDATE/DELETE, DDL, txn control, SELECT without FROM
4.25 + 4.7	Error handling framework	✅	Complete SQLSTATE mapping; ErrorResponse{sqlstate,message,detail,hint}
4.8	JOIN (nested loop)	✅	INNER/LEFT/RIGHT/CROSS; USING; multi-table; FULL→NotImplemented
4.9a+4.9c+4.9d	GROUP BY + Aggregates + HAVING	✅	COUNT/SUM/MIN/MAX/AVG; hash-based; HAVING; NULL grouping
4.10+4.10b+4.10c	ORDER BY + LIMIT/OFFSET	✅	Multi-column; NULLS FIRST/LAST; LIMIT/OFFSET pagination
4.12	DISTINCT	✅	HashSet dedup on output rows; NULL=NULL; pre-LIMIT
4.24	CASE WHEN	✅	Searched + simple form; NULL semantics; all contexts
4.6	INSERT … SELECT	✅	Reuses execute_select; MVCC prevents self-reads
6.1–6.3	Secondary indexes + planner	✅	CREATE INDEX, index maintenance, B-Tree point/range lookup
6.4	Bloom filter per index	✅	BloomRegistry; zero B-Tree reads for definite-absent keys (1% FPR)
6.5/6.6	Foreign key constraints	✅	REFERENCES, ALTER TABLE FK; INSERT/DELETE/CASCADE/SET NULL enforcement
6.7	Partial UNIQUE index	✅	CREATE INDEX … WHERE predicate; soft-delete uniqueness pattern
6.8	Fill factor	✅	WITH (fillfactor=N) on CREATE INDEX; B-Tree leaf split at ⌈FF×ORDER_LEAF/100⌉
6.9	FK + Index improvements	✅	PK B-Tree population; FK composite key index; composite index planner
6.10–6.12	Index statistics + ANALYZE	✅	Per-column NDV/row_count; planner cost gate (sel > 20% → Scan); ANALYZE command; staleness tracking
6.16	PK SELECT planner parity	✅	PRIMARY KEY equality/range now participate in single-table SELECT planning; PK equality bypasses the scan-biased cost gate
6.17	Indexed UPDATE candidate path	✅	UPDATE now discovers PK / indexed candidates through B-Tree access before entering the 5.20 write path
6.18	Indexed multi-row INSERT batch path	✅	Immediate multi-row VALUES statements now reuse grouped heap/index apply on indexed tables while preserving strict same-statement UNIQUE semantics
6.19	WAL fsync pipeline	🔄	Server commits now use an always-on leader-based fsync pipeline and the old timer-based `CommitCoordinator` path is gone, but the single-connection `insert_autocommit` benchmark still misses target throughput
6.20	UPDATE apply fast path	✅	PK-range UPDATE now batches candidate heap reads, skips no-op rows, batches `UpdateInPlace` WAL writes, and groups per-index delete+insert/root persistence
5	Executor (advanced)	⚠️ Planned	JOIN, GROUP BY, ORDER BY, index lookup, aggregate
6.8+	Index statistics, FK improvements	⚠️ Planned	Fill factor, composite FKs, ON UPDATE CASCADE, ANALYZE, index-only scans
7	Full MVCC	⚠️ Planned	SSI, write-write conflicts, epoch reclamation

Block 2 — SQL Completeness

Phase	Name	Status	Key deliverables
8	Advanced SQL	⚠️ Planned	Window functions, CTEs, recursive queries
9	VACUUM / GC	⚠️ Planned	Dead row cleanup, freelist compaction
10	MySQL wire protocol	⚠️ Planned	COM_QUERY, result set packets, handshake
11	TOAST	⚠️ Planned	Out-of-line storage for large values
12	Full-text search	⚠️ Planned	Inverted index, BM25 ranking
13	Foreign key checks	⚠️ Planned	Constraint validation on insert/delete
14	Vectorized execution	⚠️ Planned	SIMD scans, morsel-driven pipeline

Block 3 — Production Hardening

Phase	Name	Status
15	Connection pooling	⚠️ Planned
16	Replication (primary-replica)	⚠️ Planned
17	Point-in-time recovery (PITR)	⚠️ Planned
18	Online DDL	⚠️ Planned
19	Partitioning	⚠️ Planned
20	Column store (HTAP)	⚠️ Planned
21	VECTOR index (ANN)	⚠️ Planned
22–34	Distributed, cloud-native, AI/ML	⚠️ Future

Block 4 — Platform Surfaces and Storage Evolution

Phase	Name	Status	Key deliverables
35	Deployment and DevEx	⚠️ Planned	Docker, config tooling, release UX
36	AxiomQL Core	⚠️ Planned	Alternative read query language over the same AST/executor
37	AxiomQL Write + DDL + Control	⚠️ Planned	AxiomQL DML, DDL, control flow, maintenance
38	AxiomDB-Wasm	⚠️ Planned	Browser runtime, OPFS backend, sync, live queries
39	Clustered index storage engine	🔄 In progress	Inline PK rows, clustered internal/leaf pages, PK bookmarks in secondary indexes, logical clustered WAL/rollback, clustered crash recovery, clustered-aware CREATE TABLE

Completed Phases — Summary

Phase 1 — Storage Engine

A generic storage layer with two implementations: MmapStorage for production disk use and MemoryStorage for tests. Every higher-level component uses only the StorageEngine trait — storage is pluggable. Pages are 16 KB with a 64-byte header (magic, page type, CRC32c checksum, page_id, LSN, free pointers). Heap pages use a slotted format: slots grow from the start, tuples grow from the end toward the center.

Phase 2 — B+ Tree CoW

A persistent, Copy-on-Write B+ Tree over StorageEngine. Keys up to 64 bytes; ORDER_INTERNAL = 223, ORDER_LEAF = 217 (derived to fill exactly one 16 KB page). Root is an AtomicU64 — readers are lock-free by design. Supports insert (with recursive split), delete (with rebalance/redistribute/merge), and range scan via RangeIter. Prefix compression for internal nodes in memory.

Phase 3 — WAL and Transactions ✅ 100% complete

Append-only Write-Ahead Log with binary entries, CRC32c checksums, and forward/backward scan iterators. TxnManager coordinates BEGIN/COMMIT/ROLLBACK with snapshot assignment. Five-step checkpoint protocol. Crash recovery state machine (five states). Catalog bootstrap creates the three system tables on first open. CatalogReader provides MVCC-consistent schema reads. Nine crash scenario tests with a post-recovery integrity checker.

Phase 3 late additions (3.17–3.19):

3.17 WAL batch append — record_insert_batch() uses WalWriter::reserve_lsns(N) + write_batch() to write N Insert WAL entries in a single write_all call. Reduces BufWriter overhead from O(N rows) to O(1) for bulk inserts.
3.18 WAL PageWrite — EntryType::PageWrite = 9. One WAL entry per affected heap page instead of one per row. new_value = post-modification page bytes (16 KB) + embedded slot IDs for crash recovery undo. For a 10K-row bulk insert: 42 WAL entries instead of 10,000 — 238× fewer serializations and 30% smaller WAL file.
3.19 WAL Group Commit — CommitCoordinator batches DML commits from concurrent connections. DML commits write to the WAL BufWriter, register with the coordinator, and release the Database lock before awaiting fsync confirmation. A background Tokio task performs one flush+fsync per batch window (group_commit_interval_ms), then notifies all waiting connections. Enables near-linear concurrent write scaling.

Phase 4 — SQL Processing

SQL AST covering all DML (SELECT, INSERT, UPDATE, DELETE) and DDL (CREATE/DROP/ALTER TABLE, CREATE/DROP INDEX). logos-based lexer with ~85 tokens, case-insensitive keywords, zero-copy identifiers. Recursive descent parser with full expression precedence. Expression evaluator with three-valued NULL logic (AND, OR, NOT, IS NULL, BETWEEN, LIKE, IN). Semantic analyzer with BindContext, qualified/unqualified column resolution, ambiguity detection, and subquery support. Row codec with null bitmap, u24 string lengths, and O(n) encoded_len().

Near-Term Priorities

Phase 13 — Row-Level Writer Concurrency

The current implementation uses Arc<tokio::sync::RwLock<Database>>: reads can overlap, but mutating statements are still serialized at whole-database scope. Phase 13.7 removes that bottleneck with row-level locking. Phase 13.8 adds deadlock detection, and 13.8b adds SELECT ... FOR UPDATE, NOWAIT, and SKIP LOCKED.

Phase 5

Phase 5 is now complete. The last close was:

5.15 DSN parsing — AxiomDB-owned surfaces now accept typed DSNs: AXIOMDB_URL for server bootstrap plus Db::open_dsn, AsyncDb::open_dsn, and axiomdb_open_dsn for embedded mode. mysql:// and postgres:// are parse aliases only; the server still speaks MySQL wire only and embedded mode still accepts only local-path DSNs.

Phase 5 also closed the recent runtime/perf subphases:

5.11c Explicit connection state machine — the MySQL server now has an explicit CONNECTED → AUTH → IDLE → EXECUTING → CLOSING transport lifecycle with fixed auth timeout, wait_timeout vs interactive_timeout behavior, net_write_timeout for packet writes, and socket keepalive configured separately from SQL session state.
5.19a Executor decomposition — the SQL executor now lives in a responsibility-based executor/ module tree instead of one monolithic file, which lowers the cost of later DML and planner work.
5.19 B+Tree batch delete — DELETE WHERE and the old-key half of UPDATE now stage exact encoded keys per index and remove them with one ordered delete_many_in(...) pass per tree instead of one delete_in(...) traversal per row.
5.19b Eval decomposition — the expression evaluator now lives under a responsibility-based eval/ module tree with the same public API, which lowers the cost of future built-in and collation work without changing SQL behavior.
5.20 Stable-RID UPDATE fast path — UPDATE can now rewrite rows in the same heap slot when the new encoded row fits, preserve the RecordId, and skip unnecessary index maintenance for indexes whose logical key membership is unchanged.
5.21 Transactional INSERT staging — explicit transactions now buffer consecutive INSERT ... VALUES statements per table and flush them together on COMMIT or the next barrier statement, preserving savepoint semantics by flushing before the next statement savepoint whenever the batch cannot continue.

Phase 6 closing note — Integrity and recovery

Phase 6 now closes with startup index integrity verification:

every catalog-visible index is compared against heap-visible rows after WAL recovery
readable divergence is repaired automatically from heap contents
unreadable index trees fail open with IndexIntegrityFailure

SQL REINDEX remains deferred to the later diagnostics / administration phases.

Phase 6 closing note — Indexed multi-row INSERT on indexed tables

Phase 6 also closes the remaining immediate multi-row VALUES debt on indexed tables:

shared batch-apply helpers are now reused by both 5.21 staging flushes and the immediate INSERT ... VALUES (...), (... ) path
PRIMARY KEY and secondary indexes no longer force a per-row fallback for multi-row VALUES statements
same-statement UNIQUE detection remains strict because the immediate path does not reuse the staged committed_empty shortcut

Index range scan — range predicate via RangeIter.
Projection — evaluate SELECT expressions over rows from the scan.
Filter — apply WHERE expression using the evaluator from Phase 4.17.
Nested loop join — INNER JOIN, LEFT JOIN.
Sort — ORDER BY with NULLS FIRST/LAST.
Limit/Offset — LIMIT n OFFSET m.
Hash aggregate — GROUP BY with COUNT, SUM, AVG, MIN, MAX.
INSERT / UPDATE / DELETE — write path with WAL integration.

The executor will be a simple volcano-model interpreter in Phase 5. Vectorized execution (morsel-driven, SIMD) is planned for Phase 14.

AxiomQL — Alternative Query Language (Phases 36-37)

AxiomDB will support two query languages sharing one AST and executor:

SQL stays as the primary language with full wire protocol compatibility. Every ORM, client, and tool works without changes.

AxiomQL is an optional method-chain alternative designed to be learned in minutes by any developer who already uses .filter().sort().take() in JavaScript, Python, Rust, or C#:

users
  .filter(active, age > 18)
  .join(orders)
  .group(country, total: count())
  .sort(total.desc)
  .take(10)

Both languages compile to the same Stmt AST — zero executor overhead, every SQL feature automatically available in AxiomQL. Planned after Phase 8 (wire protocol).

Phase	Scope
36	AxiomQL parser: SELECT, filter, join, group, subqueries, let bindings
37	AxiomQL write + DDL: insert, update, delete, create, transaction, proc

Benchmarks

All benchmarks run on Apple M2 Pro (12 cores), 32 GB RAM, NVMe SSD, single-threaded, warm data (all pages in OS page cache unless noted). Criterion.rs is used for all micro-benchmarks; each measurement is the mean of at least 100 samples.

Reference values for MySQL 8 and PostgreSQL 15 are measured in-process (no network), without WAL for pure codec/parser operations. Operations that include WAL (INSERT, UPDATE) are directly comparable.

SQL Parser

Benchmark	AxiomDB	sqlparser-rs	MySQL ~	PostgreSQL ~	Verdict
Simple SELECT (1 table)	492 ns	4.8 µs	~500 ns	~450 ns	✅ parity with PG
Complex SELECT (multi-JOIN)	2.7 µs	46 µs	~4.0 µs	~3.5 µs	✅ 1.3× faster than PG
CREATE TABLE	1.1 µs	14.5 µs	~2.5 µs	~2.0 µs	✅ 1.8× faster than PG
Batch (100 statements)	47 µs	—	~90 µs	~75 µs	✅ 1.6× faster than PG

vs sqlparser-rs: 9.8× faster on simple SELECT, 17× faster on complex SELECT.

The speed advantage comes from two decisions:

logos DFA lexer — compiles token patterns to a Deterministic Finite Automaton at build time. Scanning runs in O(n) time with 1–3 CPU instructions per byte.
Zero-copy tokens — Ident tokens are &'src str slices into the original input. No heap allocation occurs during lexing or AST construction.

B+ Tree Index

Benchmark	AxiomDB	MySQL ~	PostgreSQL ~	Target	Max acceptable	Verdict
Point lookup (1M rows)	1.2M ops/s	~830K ops/s	~1.1M ops/s	800K ops/s	600K ops/s	✅
Range scan 10K rows	0.61 ms	~8 ms	~5 ms	45 ms	60 ms	✅
Insert (sequential keys)	195K ops/s	~150K ops/s	~120K ops/s	180K ops/s	150K ops/s	✅
Sequential scan 1M rows	0.72 s	~0.8 s	~0.5 s	0.8 s	1.2 s	✅
Concurrent reads ×16	linear	~2× degradation	~1.5× degradation	linear	<2× degradation	✅

Why point lookup is fast: the CoW B+ Tree root is an AtomicU64. Readers load it with Acquire and traverse 3–4 levels of 16 KB pages that are already in the OS page cache. No mutex, no RWLock.

Why range scan is very fast: RangeIter re-traverses from the root to locate each successive leaf after exhausting the current one. With CoW, next_leaf pointers cannot be maintained consistently (a split copies the leaf, leaving the previous leaf’s pointer stale). Tree retraversal costs O(log n) per leaf boundary crossing — at 3–4 levels deep this is 3–5 page reads, all already in the OS page cache for sequential workloads. The deferred next_leaf fast path (Phase 7) will reduce this to O(1) per boundary once epoch-based reclamation is available.

`SELECT ... WHERE pk = literal` After `6.16`

Phase 6.16 fixes the planner gap that still prevented single-table SELECT from using the PRIMARY KEY B+Tree. The executor already supported IndexLookup and IndexRange; the missing piece was planner eligibility plus a forced path for PK equality.

Measured with:

python3 benches/comparison/local_bench.py --scenario select_pk --rows 5000 --table

Operation	MariaDB 12.1	MySQL 8.0	AxiomDB
`SELECT * FROM bench_users WHERE id = literal`	12.7K lookups/s	13.4K lookups/s	11.1K lookups/s

This closes the old “full scan on PK lookup” debt. The remaining gap is no longer planner-side; it is now in SQL/wire overhead after the PK B+Tree path is already active.

Row Codec

Benchmark	Throughput	Notes
`encode_row`	33M rows/s	5-column mixed-type row
`decode_row`	28M rows/s	Same layout
`encoded_len`	O(n), no alloc	Size computation without buffer allocation

The codec encodes a null bitmap (1 bit per column, packed into bytes) followed by the column payloads in declaration order. Variable-length types use a 3-byte (u24) length prefix. Fixed-size types (integers, floats, DATE, TIMESTAMP, UUID) have no length prefix.

Expression Evaluator

Benchmark	AxiomDB	MySQL ~	PostgreSQL ~	Verdict
Expr eval over 1K rows	14.8M rows/s	~8M rows/s	~6M rows/s	✅ 1.9× faster than MySQL

The evaluator is a recursive interpreter over the Expr enum. Speed comes from inlining the hot path (column reads, arithmetic, comparisons) and from the fact that col_idx is resolved once by the semantic analyzer — no name lookup at eval time.

Performance Budget

The following thresholds are enforced before any phase is closed. A result below the “Max acceptable” column is a blocker.

Operation	AxiomDB	Target	Max acceptable	Phase measured
Point lookup PK	1.2M ops/s ✅	800K ops/s	600K ops/s	2
Range scan 10K rows	0.61 ms ✅	45 ms	60 ms	2
B+ Tree INSERT (storage only)	195K ops/s ✅	180K ops/s	150K ops/s	3
INSERT end-to-end 10K batch (SchemaCache)	36K ops/s ⚠️	180K ops/s	150K ops/s	4.16b
SELECT via wire protocol (autocommit)	185 q/s ✅	—	—	5.14
INSERT via wire protocol (autocommit)	58 q/s —	—	—	5.14
Sequential scan 1M rows	0.72 s ✅	0.8 s	1.2 s	2
Concurrent reads ×16	linear ✅	linear	<2× degradation	2
Parser — simple SELECT	492 ns ✅	600 ns	1 µs	4
Parser — complex SELECT	2.7 µs ✅	3 µs	6 µs	4
Row codec encode	33M rows/s ✅	—	—	4
Expr eval (scan 1K rows)	14.8M rows/s ✅	—	—	4

Executor end-to-end (Phase 4.16b, MmapStorage + real WAL, full pipeline)

Measured with cargo bench --bench executor_e2e -p axiomdb-sql (Apple M2 Pro, NVMe, release build). Pipeline: parse → analyze → execute → WAL → MmapStorage.

Configuration	AxiomDB	Target (Phase 8)	Notes
INSERT 100 rows / 1 txn (no SchemaCache)	2.8K ops/s	—	cold path, catalog scan
INSERT 1K rows / 1 txn (no SchemaCache)	18.5K ops/s	—	amortization starts
INSERT 1K rows / 1 txn (SchemaCache)	20.6K ops/s	—	+8% vs no cache
INSERT 10K rows / 1 txn (SchemaCache)	36K ops/s	180K ops/s	⚠️ WAL bottleneck
INSERT autocommit (1 fsync/row)	58 q/s	—	1 fdatasync per statement (wire protocol, Phase 5.14)

Root cause — WAL record_insert() dominates: each row write costs ~20 µs inside record_insert() even without fsync. Parse + analyze cost per INSERT is ~1.5 µs total; SchemaCache eliminates catalog heap scans but only improves throughput by 8% because WAL overhead is already the dominant term. The 180K ops/s target is a Phase 8 goal: prepared statements skip parse and analyze entirely, and a batch insert API will write one WAL entry per batch rather than one per row.

⚙️

Design Decision — WAL per-row write The current WAL implementation writes one entry per inserted row via record_insert(). This makes recovery straightforward — each row is an independent, self-contained undo/redo unit — but costs ~20 µs/row at the WAL layer regardless of fsync. The 36K ops/s ceiling at 10K batch size is a direct consequence of this design. PostgreSQL and MySQL both offer bulk-load paths (COPY, LOAD DATA) that bypass per-row WAL overhead; AxiomDB's equivalent is the Phase 8 batch insert API, which will coalesce WAL entries and write them in a single sequential append.

B+ Tree storage-only INSERT (no SQL parsing, no WAL):

Operation	AxiomDB	MySQL ~	PostgreSQL ~	Target	Max acceptable	Verdict
B+Tree INSERT (storage only)	195K ops/s	~150K ops/s	~120K ops/s	180K ops/s	150K ops/s	✅

The storage layer itself exceeds the 180K ops/s target. The gap between 195K (storage only) and 36K (full pipeline) isolates the overhead to the WAL record path, not the B+ Tree or the page allocator.

Run end-to-end benchmarks:

cargo bench --bench executor_e2e -p axiomdb-sql

# MySQL + PostgreSQL comparison (requires Docker):
./benches/comparison/setup.sh
python3 benches/comparison/bench_runner.py --rows 10000
./benches/comparison/teardown.sh

Phase 5.14 — Wire Protocol Throughput

Measured via the MySQL wire protocol (pymysql client, autocommit mode, 1 connection, localhost, Apple M2 Pro, NVMe SSD).

Benchmark	AxiomDB	MySQL ~	PostgreSQL ~	Notes
COM_PING	24,865/s	~30K/s	~25K/s	Pure protocol, no SQL engine
SET NAMES (intercepted)	46,672/s	~20K/s	—	Handled in protocol layer
SELECT 1 (autocommit)	185 q/s	~5K–15K q/s*	~5K–12K q/s*	Full pipeline, read-only
INSERT (autocommit, 1 fsync/stmt)	58 q/s	~130–200 q/s*	~100–160 q/s*	Full pipeline + fsync

*MySQL/PostgreSQL figures are in-process estimates without network latency overhead. AxiomDB throughput measured over localhost with real round-trips; the gap reflects the current single-threaded autocommit path and will improve with Phase 5.13 plan cache and Phase 8 batch API.

Phase 5.14 fix — read-only WAL fsync eliminated:

Prior to Phase 5.14, every autocommit transaction called fdatasync on WAL commit, including read-only queries such as SELECT. This cost 10–20 ms per SELECT, capping throughput at ~56 q/s.

The fix: skip fdatasync (and the WAL flush) when the transaction has no DML operations (undo_ops.is_empty()). Read-only transactions still flush buffered writes to the OS (BufWriter::flush) so that concurrent readers see committed state, but they do not wait for the fdatasync round-trip to persistent storage.

Before / after:

Query	Before (5.13)	After (5.14)	Improvement
SELECT 1 (autocommit)	~56 q/s	185 q/s	3.3×
INSERT (autocommit)	~58 q/s	58 q/s	no change (fsync required)

⚙️

Design Decision — read-only fsync skip The WAL commit path gates durability on fdatasync. For DML transactions this is correct — data must reach persistent storage before the client receives OK. For read-only transactions there is nothing to persist: the transaction produced no WAL records. Skipping fdatasync for undo_ops.is_empty() transactions is therefore safe: crash recovery cannot lose data that was never written. PostgreSQL applies the same principle — read-only transactions in PostgreSQL do not touch the WAL at all. The OS-level flush (BufWriter::flush) is kept so that any WAL bytes written by a concurrent writer are visible to the OS before the SELECT returns, preserving read-after-write consistency within the same process.

Bottleneck analysis:

SELECT 185 q/s: each COM_QUERY runs a full parse + analyze cycle (~1.5 µs) plus one wire protocol round-trip (~40 µs on localhost). The dominant cost is the round-trip. For prepared statements (COM_STMT_EXECUTE), Phase 5.13 plan cache eliminates the parse/analyze step entirely — the cached AST is reused and only a ~1 µs parameter substitution pass runs before execution. The remaining bottleneck for higher throughput is WAL transaction overhead per statement (BEGIN/COMMIT I/O); this will be addressed by Phase 6 indexed reads (eliminating full-table scans) and the Phase 8 batch API.
INSERT 58 q/s: one fdatasync per autocommit statement is required for durability.

Phase 5.21 — Transactional INSERT staging

Measured with python3 benches/comparison/local_bench.py --scenario insert --rows 50000 --table against a release AxiomDB server and local MariaDB/MySQL instances on the same machine. Workload: 50,000 separate one-row INSERT statements inside one explicit transaction.

Benchmark	MariaDB 12.1	MySQL 8.0	AxiomDB	Notes
`insert` (single-row INSERTs in 1 txn)	28.0K rows/s	26.7K rows/s	23.9K rows/s	one `BEGIN`, 50K INSERT statements, one `COMMIT`

What changed in 5.21:

the session now buffers consecutive eligible INSERT ... VALUES rows for the same table instead of writing heap/WAL immediately
barriers such as SELECT, UPDATE, DELETE, DDL, COMMIT, table switch, or ineligible INSERT shapes force a flush
the flush uses insert_rows_batch_with_ctx(...) plus grouped post-heap index maintenance, persisting each changed index root once per flush

⚙️

Borrowed Technique AxiomDB borrows the “produce rows first, write them later” pattern from PostgreSQL heap_multi_insert() and DuckDB's appender, but keeps SQL semantics intact by flushing before the next statement savepoint whenever the batch cannot continue.

This is deliberately not the same as autocommit group commit. The benchmark already uses one explicit transaction, so 5.21 attacks per-statement heap/WAL/index work rather than fsync batching across multiple commits.

Phase 6.19 — WAL fsync pipeline

Measured with:

python3 benches/comparison/local_bench.py --scenario insert_autocommit --rows 1000 --table --engines axiomdb

Workload: one INSERT per transaction over the MySQL wire.

Benchmark	AxiomDB	Target	Status
`insert_autocommit`	224 ops/s	`>= 5,000 ops/s`	❌

What changed in 6.19:

the old timer-based CommitCoordinator and its config knobs were removed
server DML commits now hand deferred durability to an always-on leader-based FsyncPipeline
queued followers can piggyback on a leader fsync when their commit_lsn is already covered

What the benchmark taught us:

the implementation is correct and wire-visible semantics remain intact
but the target workload is sequential request/response autocommit
the handler still waits for durability before it sends OK
therefore the next statement cannot arrive while the current fsync is in flight, so single-connection piggyback never materializes

6.19 is closed as an implementation subphase, but this benchmark remains a documented performance gap rather than a solved target.

⚙️

Borrowed Technique, Different Constraint AxiomDB borrowed MariaDB's leader/follower fsync idea, but MariaDB's win depends on overlapping arrivals. The local benchmark uses a strictly sequential MySQL client, so the server never has the next autocommit statement in hand while the current fsync is still running.

Phase 6.18 — Indexed multi-row INSERT batch path

Measured with:

python3 benches/comparison/local_bench.py --scenario insert_multi_values --rows 5000 --table

Workload: multi-row INSERT ... VALUES (...), (... ) statements against the benchmark schema with PRIMARY KEY (id).

Operation	MariaDB 12.1	MySQL 8.0	AxiomDB
`insert_multi_values` on PK table	160,581 rows/s	259,854 rows/s	321,002 rows/s

What changed in 6.18:

the immediate multi-row VALUES path no longer checks secondary_indexes.is_empty() before using grouped heap writes
grouped heap/index apply was extracted into shared helpers reused by both:
- the transactional staging flush from 5.21
- the immediate INSERT ... VALUES (...), (... ) path
the immediate path keeps strict UNIQUE semantics by not reusing the staged committed_empty shortcut, because same-statement duplicate keys must still fail without leaking partial rows

🚀

2× Faster Than MariaDB On the PK-only `insert_multi_values` benchmark, AxiomDB reaches 321,002 rows/s vs MariaDB 12.1 at 160,581 rows/s. The gain comes from one grouped heap/index apply per VALUES statement instead of falling back to one heap/index maintenance cycle per row.

⚙️

Design Decision — Share Apply, Keep UNIQUE Strict PostgreSQL's heap_multi_insert() and DuckDB's appender both separate row staging from physical write. AxiomDB borrows the grouped physical apply idea, but rejects a blind bulk-load shortcut on the immediate path: duplicate keys inside one SQL statement must still be rejected before any partial batch becomes visible.

Phase 6.20 — UPDATE apply fast path

Measured with python3 benches/comparison/local_bench.py --scenario update_range --rows 5000 --table against a release AxiomDB server and local MariaDB/MySQL instances on the same machine. Workload: UPDATE bench_users SET score = score + 1 WHERE id BETWEEN ... on a PK-indexed table.

Benchmark	MariaDB 12.1	MySQL 8.0	AxiomDB	Notes
`update_range`	618K rows/s	291K rows/s	369.9K rows/s	PK range UPDATE now stays on a batched read/apply path end-to-end

What changed in 6.20:

IndexLookup / IndexRange candidate rows are fetched through read_rows_batch(...) instead of one heap read per RID
no-op UPDATE rows are filtered before heap/index mutation
stable-RID rows batch UpdateInPlace WAL append with reserve_lsns(...) + write_batch(...)
UPDATE index maintenance now uses grouped delete+insert with one root persistence write per affected index
both ctx and non-ctx UPDATE paths share a statement-level index bailout

This closes the dominant apply-side debt left behind after 6.17. The benchmark improves by 4.3x over the 6.17 result (85.2K rows/s) and now beats the documented local MySQL result on the same workload.

🚀

Performance Advantage AxiomDB improves `update_range` from 85.2K to 369.9K rows/s after `6.20`, a 4.3x gain, and overtakes MySQL 8.0 (291K rows/s) by keeping PK-range UPDATE inside one batched heap/WAL apply path.

⚙️

Design Decision — Batch Without New WAL Type MariaDB's `row0upd.cc` and PostgreSQL's heap update path optimize clustered-row UPDATE primarily by reducing repeated heap/index work, not by inventing a new redo format first. AxiomDB follows that trade-off in `6.20`: it reuses `UpdateInPlace` for recovery compatibility and attacks the per-row apply overhead around it.

Phase 5.19 / 5.20 — DELETE WHERE and UPDATE Write Paths

Measured with python3 benches/comparison/local_bench.py --scenario all --rows 50000 --table on the same Apple M2 Pro machine. The benchmark uses the MySQL wire protocol and a bench_users table with PRIMARY KEY (id).

Operation	MariaDB 12.1	MySQL 8.0	AxiomDB	PostgreSQL 16
`DELETE WHERE id > 25000`	652K rows/s	662K rows/s	1.13M rows/s	3.76M rows/s
`UPDATE ... WHERE active = TRUE`	662K rows/s	404K rows/s	648K rows/s	270K rows/s

5.19 removed the old per-row delete_in(...) loop by batching exact encoded keys per index through delete_many_in(...). 5.20 finished the UPDATE recovery by preserving the original RID whenever the rewritten row still fits in the same slot.

For UPDATE, the before/after delta is the important signal:

Post-5.19 / pre-5.20: 52.9K rows/s
Post-5.20: 648K rows/s

That is a ~12.2× improvement on the same workload.

🚀

Performance Advantage After `5.20`, AxiomDB's `UPDATE ... WHERE active = TRUE` reaches 648K rows/s, beating MySQL 8 (404K) and PostgreSQL 16 (270K) on the same 50K-row local benchmark. The gain comes from avoiding RID churn and untouched-index rewrites whenever the row still fits in its original heap slot.

⚙️

Design Decision — Two-Step DML Recovery `5.19` and `5.20` fix different write-path costs. Batch-delete removes repeated B+Tree descents for stale keys; stable-RID update removes heap delete+insert and makes index skipping safe. Keeping them as separate subphases made the remaining bottleneck visible after each step.

Phase 5.13 — Prepared Statement Plan Cache

Phase 5.13 introduces an AST-level plan cache for prepared statements. The full parse + analyze pipeline runs once at COM_STMT_PREPARE time; each subsequent COM_STMT_EXECUTE performs only a tree walk to substitute parameter values (~1 µs) and then calls execute_stmt() directly.

Path	Parse + Analyze	Param substitution	Total SQL overhead
`COM_QUERY` (text protocol)	~1.5 µs per call	—	~1.5 µs
`COM_STMT_EXECUTE` before 5.13	~1.5 µs per call (re-parse)	string replace	~1.5 µs
`COM_STMT_EXECUTE` after 5.13	0 (cached)	~1 µs AST walk	~1 µs

The ~0.5 µs saving per execute is meaningful for high-frequency statement patterns (e.g., ORM-generated queries that re-execute the same SELECT or INSERT with different parameters on every request).

Remaining bottleneck: the dominant cost per COM_STMT_EXECUTE is now the WAL transaction overhead (BEGIN/COMMIT I/O) rather than parse/analyze. For read-only prepared statements, Phase 6 indexed reads will eliminate full-table scans, reducing the per-query execution cost. For write statements, the Phase 8 batch API will coalesce WAL entries, targeting the 180K ops/s budget.

⚙️

Design Decision — AST cache, not string cache The plan cache stores the analyzed Stmt (AST with resolved column indices) rather than the original SQL string. This means each execute avoids both lexing and semantic analysis, not just parsing. The trade-off is that the cached AST must be cloned before parameter substitution to avoid mutating shared state — a shallow clone of the expression tree is ~200 ns, well below the ~1.5 µs that parse + analyze would cost. MySQL and PostgreSQL cache parsed + planned query trees for the same reason.

Running Benchmarks Locally

# B+ Tree
cargo bench --bench btree -p axiomdb-index

# Storage engine
cargo bench --bench storage -p axiomdb-storage

# SQL parser
cargo bench --bench parser -p axiomdb-sql

# All benchmarks
cargo bench --workspace

# Compare before/after a change
cargo bench -- --save-baseline before
# ... make change ...
cargo bench -- --baseline before

# Detailed comparison with critcmp
cargo install critcmp
critcmp before after

Benchmarks use Criterion.rs and emit JSON results to target/criterion/. Each run reports mean, standard deviation, min, max, and throughput (ops/s or bytes/s depending on the benchmark).

Design Decisions

This page documents the most consequential architectural choices made during AxiomDB’s design. Each entry explains the alternatives considered, the reasoning, and the trade-offs accepted.

Query Languages

SQL + AxiomQL dual-language strategy

Aspect	Decision
Chosen	Two query languages sharing one AST and executor
Alternatives	SQL only; AxiomQL only; SQL-to-AxiomQL transpiler
Phase	Phase 12+ (post wire protocol)

SQL is the primary language. Full MySQL/PostgreSQL wire protocol compatibility. All ORMs, clients, and tools work without changes. Nothing breaks for anyone.

AxiomQL is an optional alternative — a method-chain query language for developers who prefer modern, readable syntax. It compiles to the same Stmt AST as SQL, so there is zero executor overhead and every SQL feature is automatically available in AxiomQL.

SQL  ──────┐
           ├──► AST ──► Optimizer ──► Executor
AxiomQL ───┘

AxiomQL syntax reads top-to-bottom in the logical order of execution:

users
  .filter(active, age > 18)
  .join(orders)
  .group(country, total: count())
  .sort(total.desc)
  .take(10)

This is already familiar to any developer who uses .filter().map().sort() in JavaScript, Python, Rust, or C#. The learning curve is ~10 minutes.

Why not SQL-only: SQL’s evaluation order (SELECT before FROM, HAVING separate from WHERE) is a 50-year-old quirk that confuses new users. AxiomQL removes the confusion without removing SQL.

Why not AxiomQL-only: Breaking compatibility with every MySQL client, ORM, and tool in existence would be unacceptable. SQL stays.

No existing database has this combination: ORMs like ActiveRecord and Eloquent are application-layer libraries, not native DB languages. PRQL compiles to SQL externally. EdgeQL is native but a different syntax family. AxiomQL would be the first native method-chain language that coexists with SQL in the same engine.

Storage

mmap over a Custom Buffer Pool

Aspect	Decision
Chosen	`memmap2::MmapMut` — OS-managed page cache
Alternatives	Custom buffer pool (like InnoDB), `io_uring` direct I/O
Phase	Phase 1 (Storage Engine)

Why mmap:

The OS page cache provides LRU eviction, readahead prefetching, and dirty page write-back for free. Implementing these correctly in user space takes months of engineering work.
Pages returned by read_page() are &Page references directly into the mapped memory — zero copy from kernel to application.
MySQL InnoDB maintains a separate buffer pool on top of the OS page cache. The same physical page lives in RAM twice (once in the kernel page cache, once in the buffer pool). mmap eliminates the second copy.
msync(MS_SYNC) provides the same durability guarantee as fsync for WAL and checkpoint flushes.

Trade-offs accepted:

No fine-grained control over eviction policy (OS uses LRU; a custom pool could use clock-sweep with hot/cold zones).
On 32-bit systems, mmap is limited by the address space. Not a concern for a modern 64-bit server database.
mmap I/O errors manifest as SIGBUS rather than Err(...). These are handled with a signal handler that converts SIGBUS to DbError::Io.

16 KB Page Size

Aspect	Decision
Chosen	16,384 bytes (16 KB)
Alternatives	4 KB (SQLite), 8 KB (PostgreSQL), 8 KB (original db.md spec)
Phase	Phase 1

Why 16 KB:

The B+ Tree ORDER constants (ORDER_INTERNAL = 223, ORDER_LEAF = 217) yield a highly efficient fan-out with 16 KB pages. At 4 KB, the order would be ~54 for internal nodes — requiring 4× more page reads for the same number of keys.
At 16 KB, a tree covering 1 billion rows has depth 4. At 4 KB, depth 5 (25% more I/O for every lookup).
OS readahead typically prefetches 128–512 KB, making 16 KB the sweet spot: small enough that random access is not wasteful, large enough for sequential workloads.
64-byte header leaves 16,320 bytes for the body — a natural fit for the bytemuck::Pod structs that avoid alignment issues.

Indexing

Copy-on-Write B+ Tree

Aspect	Decision
Chosen	CoW B+ Tree with `AtomicU64` root swap
Alternatives	Traditional B+ Tree with read-write locks; LSM-tree (like RocksDB); Fractal tree
Phase	Phase 2 (B+ Tree)

Why CoW B+ Tree:

Readers are completely lock-free. A SELECT on a billion-row table never blocks any concurrent INSERT, UPDATE, or DELETE.
MVCC is “built in” — readers hold a pointer to the old root and see a consistent snapshot of the tree, exactly as MVCC requires.
No deadlocks are possible during tree traversal (locks are never held during reads).
Writes amplify by O(log n) page copies, but at depth 4 this is 4 × 16 KB = 64 KB per insert — acceptable for the target workload (OLTP, not write-heavy OLAP).

Why not LSM:

LSM-trees have superior write throughput (sequential I/O only) but inferior read performance (must check multiple levels). AxiomDB’s target is OLTP with read-heavy workloads. A B+ Tree point lookup is O(log n) I/Os; an LSM lookup is O(L) compaction levels, each potentially requiring a disk seek.
Compaction in LSM introduces unpredictable write amplification spikes that are difficult to tune for latency-sensitive OLTP.

next_leaf Not Used in Range Scans

Aspect	Decision
Chosen	Re-traverse from root to find the next leaf on each boundary crossing
Alternatives	Keep the `next_leaf` linked list consistent under CoW
Phase	Phase 2

Why: Under CoW, next_leaf pointers in old leaf pages point to other old pages that may have been freed. Maintaining a consistent linked list under CoW requires copying the previous leaf on every insert near a boundary — but the previous leaf’s page_id is not known during a top-down write path without additional bookkeeping.

The cost of the adopted solution (O(log n) per leaf boundary) is acceptable: for a 10,000-row range scan across ~47 leaves (217 rows/leaf), there are 46 boundary crossings, each costing 4 page reads = 184 extra page reads. At a measured scan time of 0.61 ms for 10,000 rows, this is within the 45 ms budget by a factor of 73.

Durability

WAL Without Double-Write Buffer

Aspect	Decision
Chosen	WAL with per-page CRC32c; no double-write buffer
Alternatives	Double-write buffer (MySQL InnoDB); full page WAL images (PostgreSQL)
Phase	Phase 3 (WAL)

Why no double-write:

MySQL writes each page twice: once to the doublewrite buffer and once to the actual position. The doublewrite buffer protects against torn writes (partial page writes due to power failure mid-write).
AxiomDB protects against torn writes with a CRC32c checksum per page. If a page has an invalid checksum on startup, it is reconstructed from the WAL. This requires the WAL to contain the information needed for reconstruction — which it does (the WAL records the full new_value for each UPDATE/INSERT).
Eliminating the double-write buffer halves the disk writes for every dirty page flush.

Trade-off: Recovery requires reading more WAL data. If many pages are corrupted (e.g., a full power failure after a long write batch), recovery replays more WAL entries. In practice, with modern UPS and filesystem journaling, full-file corruption is rare. The WAL’s CRC32c catches partial writes reliably.

Physical WAL (not Logical WAL)

Aspect	Decision
Chosen	Physical WAL: records (page_id, slot_id, old_bytes, new_bytes)
Alternatives	Logical WAL: records SQL-level operations (INSERT INTO t VALUES…)
Phase	Phase 3

Why physical:

Recovery is redo-only: replay each committed WAL entry at its exact physical location. No UNDO pass required (uncommitted changes are simply ignored).
Physical location (page_id, slot_id) allows direct seek to the affected page — O(1) per WAL entry, not O(log n) B+ Tree traversal.
The WAL key encodes page_id:8 + slot_id:2 in 10 bytes, making the physical location self-contained in the WAL record.

Trade-off: Physical WAL entries are larger than logical ones (they contain the full encoded row bytes, not a SQL expression). For a row with 100 bytes of data, the WAL entry is ~100 + 43 bytes overhead = ~143 bytes. A logical WAL entry might be smaller for simple inserts. However, the simplicity and speed of redo-only physical recovery outweighs the size difference.

SQL Processing

logos for Lexing

Aspect	Decision
Chosen	`logos` crate — compiled DFA
Alternatives	`nom` combinators; `pest` PEG; hand-written lexer; `lalrpop`
Phase	Phase 4.2 (SQL Lexer)

Why logos:

logos compiles all token patterns (keywords, identifiers, literals) into a single DFA at build time. Runtime cost per character is a table lookup — 1–3 CPU instructions.
The ignore(ascii_case) attribute makes keyword matching case-insensitive with no runtime cost (the DFA is built with both cases folded).
Zero-copy: Ident(&'src str) slices into the input without heap allocation.
Measured throughput: 9–17× faster than sqlparser-rs for the same inputs.

nom is an excellent choice for context-free parsing with backtracking but is over-engineered for a lexer: a lexer is a regular language (no backtracking needed), and DFA is the optimal algorithm for it.

Zero-Copy Tokens

Aspect	Decision
Chosen	`Token::Ident(&'src str)` — lifetime-tied reference into the input
Alternatives	`Token::Ident(String)` — owned heap allocation; `Token::Ident(Arc<str>)`
Phase	Phase 4.2

Why zero-copy:

Heap allocation per identifier would cost ~30 ns on modern hardware (involving a malloc call). For a query with 20 identifiers, that is 600 ns of allocation overhead.
At 2M queries/s (the target throughput), 600 ns per query consumes 1.2 s per second of CPU time in allocations — impossible to sustain.
Zero-copy tokens require the input string to outlive the token stream, which is a natural constraint: the input is always available until the query finishes.

MVCC Implementation

RowHeader in Heap Pages (not Undo Tablespace)

Aspect	Decision
Chosen	MVCC metadata (`xmin`, `xmax`, `deleted`) in each heap row
Alternatives	Separate undo tablespace (MySQL InnoDB); version chain in B+ Tree (PostgreSQL MVCC heap)
Phase	Phase 3 (TxnManager)

Why inline RowHeader:

A historical row version is visible in its original heap location. No additional I/O is needed to read old versions — they are in the same page as the current version.
MySQL’s undo tablespace (ibdata1) requires additional I/O for reads that need old row versions (the reader follows a pointer chain from the clustered index into the undo tablespace).
Inline metadata is simpler to implement and audit.

Trade-offs:

Dead rows occupy space in the heap until VACUUM (Phase 9) cleans them up.
The RowHeader adds 24 bytes overhead per row. For a table with 50-byte average rows, this is 32% overhead. Acceptable for the generality it provides.

Collation

UCA Root as Default Collation

Aspect	Decision
Chosen	Unicode Collation Algorithm (UCA) root for string comparison
Alternatives	ASCII byte order; locale-specific collation; C locale (PostgreSQL default)
Phase	Phase 4 (Types)

Why UCA root:

ASCII byte order (strcmp) gives incorrect ordering for most non-English text: ‘ä’ sorts after ‘z’ in ASCII, but should sort near ‘a’.
UCA root is locale-neutral (deterministic across any server environment) while still correct for most languages.
MySQL’s default collation (utf8mb4_general_ci) is not standards-compliant.
UCA root is implemented by the icu crate — same algorithm used by modern browsers for Intl.Collator.

WAL Optimization

Per-Page WAL Entries (PageWrite) vs Per-Row WAL Entries

Aspect	Decision
Chosen	`EntryType::PageWrite = 9`: one WAL entry per heap page for bulk inserts
Alternatives	Per-row `Insert` entries (original approach); full redo log (PostgreSQL WAL)
Phase	Phase 3.18

Why per-page:

For bulk inserts (INSERT INTO t VALUES (r1),(r2),...), the per-row approach writes one WAL entry per row: 10,000 rows = 10,000 serialize_into() calls + 10,000 CRC32c computations. Per-page replaces these with ~42 entries (one per 16 KB page, holding ~240 rows each) — 238× fewer serializations and a 30% smaller WAL file.

The PageWrite entry format stores:

The full post-modification page bytes (new_value[0..PAGE_SIZE]) — available for future REDO-based power-failure recovery (Phase 3.8b).
The inserted slot IDs (new_value[PAGE_SIZE+2..]) — used by crash recovery to undo uncommitted PageWrite entries by marking each slot dead, identical in effect to undoing N individual Insert entries.

Trade-offs accepted:

Each PageWrite entry is ~16 KB vs ~100 bytes for an Insert entry. For sparse inserts (a few rows per page), PageWrite is larger. The optimization only applies to insert_rows_batch() (multi-row INSERT) — single-row inserts still use Insert entries.
Crash recovery must parse the embedded slot list instead of simply reading a single physical location. The parsing is O(num_slots) per entry — still O(N) total, identical asymptotic cost.

Why not a full redo log (like PostgreSQL WAL): PostgreSQL writes a physical page image + logical redo records for every page modification. Our PageWrite is a simplified version: we write only the post-image (for bulk inserts) and rely on the existing in-memory undo log for rollback. Full redo would require per-page LSNs and a replay pass on startup — reserved for Phase 3.8b.

⚙️

Design Decision — Slot IDs Embedded in WAL Entry An alternative design would reconstruct which slots to undo by scanning the heap page looking for slots with txn_id_created == crashed_txn_id. We rejected this because it requires reading the page from storage during the crash recovery scan — before the undo phase even begins. Embedding the slot IDs in the PageWrite entry keeps crash recovery a pure WAL read pass: no storage I/O needed to determine what to undo.

Content-Addressed BLOB Storage (Planned Phase 6)

Aspect	Decision
Planned	SHA-256 content address as the BLOB key in a dedicated BLOB store
Alternatives	Inline BLOB in the heap (PostgreSQL TOAST); external file reference
Phase	Phase 6

Why content-addressed:

Two rows storing the same attachment (e.g., a company logo in every invoice) share exactly one copy on disk. Deduplication is automatic and requires no extra schema.
The BLOB store is append-only with immutable entries — no locking on BLOB reads.
Deletion is handled by reference counting: when the last row referencing a BLOB is deleted, the BLOB can be garbage collected.

Keyboard shortcuts

AxiomDB Documentation