AxiomDB
AxiomDB is a database engine written in Rust, designed to be fast, correct, and modern — while remaining compatible with the MySQL wire protocol so existing applications can connect without driver changes.
Goals
| Goal | How |
|---|---|
| Faster than MySQL for read-heavy workloads | Copy-on-Write B+ Tree with lock-free readers |
| Crash-safe without the MySQL double-write buffer overhead | Append-only WAL, no double-write |
| Drop-in compatible with MySQL clients | MySQL wire protocol on port 3306 |
| Embeddable like SQLite | C FFI, no daemon required (Phase 10) |
| Modern SQL out of the box | Unicode-correct collation, strict mode, structured errors |
Two Usage Modes
┌─────────────────────┐ ┌──────────────────────────┐
│ SERVER MODE │ │ EMBEDDED MODE │
│ │ │ │
│ TCP :3306 │ │ Direct function call │
│ MySQL wire proto │ │ C FFI / Rust API │
│ PHP, Python, Node │ │ No network, no daemon │
└─────────────────────┘ └──────────────────────────┘
└─────────────────┬─────────────────┘
│
Same Rust engine
Current Status
AxiomDB is under active development. Phases 1–6 are substantially complete:
- ✅ Storage engine — mmap-based 16 KB pages, freelist, heap pages, CRC32c checksums
- ✅ B+ Tree — Copy-on-Write, lock-free readers, prefix compression, range scan
- ✅ WAL — append-only, crash recovery, Group Commit, PageWrite bulk optimization
- ✅ Catalog — schema management, DDL change notifications, MVCC-consistent reads
- ✅ SQL layer — full DDL + DML parser, expression evaluator, semantic analyzer
- ✅ Executor — SELECT/INSERT/UPDATE/DELETE, JOIN, GROUP BY + aggregates, ORDER BY, subqueries, CASE WHEN, DISTINCT, TRUNCATE, ALTER TABLE
- ✅ Secondary indexes — CREATE INDEX, UNIQUE, query planner (index lookup + range)
- ✅ MySQL wire protocol — port 3306, COM_QUERY, prepared statements, pymysql compatible
Current concurrency model: read-only queries run concurrently, but mutating
statements are still serialized through a database-wide Arc<RwLock<Database>>
write guard. Row-level locking and true concurrent writers are planned for
Phase 13.7+.
Performance highlights
| Operation | AxiomDB | vs competition |
|---|---|---|
| Bulk INSERT (multi-row, 10K rows) | 211K rows/s | 1.5× faster than MariaDB 12.1 |
| Full-table DELETE (10K rows) | 1M rows/s | 3× faster than MariaDB, 40× than MySQL 8.0 |
| Full scan SELECT (10K rows) | 212K rows/s | ≈ MySQL 8.0 |
| Simple SELECT parse | 492 ns | parity with MySQL |
| Range scan 10K rows | 0.61 ms | 13× faster than MySQL (45 ms target) |
What Makes AxiomDB Different
1. No double-write buffer
MySQL InnoDB uses a double-write buffer to protect against partial page writes, adding significant write overhead. AxiomDB uses a WAL-first architecture — pages are protected by the write-ahead log, eliminating this overhead entirely.
2. Lock-free read path
The B+ Tree uses Copy-on-Write semantics with an atomic root pointer, so the
storage layer itself does not need per-page read latches. In the current server
runtime, read-only queries execute concurrently, while mutating statements are
still serialized by a database-wide RwLock write guard. Row-level write
concurrency is the next planned step.
3. Smart collation out of the box
Most databases require explicit COLLATE declarations for correct Unicode sorting. AxiomDB defaults to UCA root collation (language-neutral Unicode ordering) and can be configured to behave like MySQL or PostgreSQL for migrations.
4. Strict mode always on
AxiomDB rejects data truncation, invalid dates (0000-00-00), and silent type coercions that MySQL allows by default. With SET AXIOM_COMPAT = 'mysql', lenient behavior is restored for migration scenarios.
5. Structured error messages
Inspired by the Rust compiler, every error includes: what went wrong, which table/column was involved, the offending value, and a hint for how to fix it.
Parser Performance
AxiomDB’s SQL parser is 9–17× faster than sqlparser-rs (the production standard used by Apache Arrow DataFusion and Delta Lake):
| Query type | AxiomDB | sqlparser-rs | Speedup |
|---|---|---|---|
| Simple SELECT | 492 ns | 4.38 µs | 8.9× |
| Complex SELECT (multi-JOIN) | 2.74 µs | 27.0 µs | 9.8× |
| CREATE TABLE | 824 ns | 14.5 µs | 16.6× |
This is achieved through a zero-copy lexer (identifiers are &str slices into the input — no heap allocations) combined with a hand-written recursive descent parser.
sqlparser-rs is used by Apache Arrow DataFusion, Delta Lake, and InfluxDB — widely considered the production standard for Rust SQL parsing. The 9–17× speedup is measured single-threaded, parse-only. At 2M simple queries/s, parsing is never the bottleneck for any realistic OLTP workload.
Getting Started
AxiomDB is a relational database engine written in Rust. It supports standard SQL, ACID transactions, a Write-Ahead Log for crash recovery, and a Copy-on-Write B+ Tree for lock-free concurrent reads. This guide walks you through connecting to AxiomDB, choosing a usage mode, and running your first queries.
Choosing a Usage Mode
AxiomDB operates in two distinct modes that share the exact same engine code.
Server Mode
The engine runs as a standalone daemon that speaks the MySQL wire protocol on TCP port 3306 (configurable). Any MySQL-compatible client connects without installing custom drivers.
Application (PHP / Python / Node.js)
│
│ TCP :3306 (MySQL wire protocol)
▼
axiomdb-server process
│
▼
axiomdb.db axiomdb.wal
When to use server mode:
- Web applications with REST or GraphQL APIs
- Microservices where multiple processes share a database
- Any environment where you would normally use MySQL
Embedded Mode
The engine is compiled into your process as a shared library (.so / .dylib / .dll).
There is no daemon, no network, and no port. Calls go directly to Rust code with
microsecond latency.
Your Application (Rust / C++ / Python / Electron)
│
│ direct function call (C FFI / Rust crate)
▼
AxiomDB engine (in-process)
│
▼
axiomdb.db axiomdb.wal (local files)
When to use embedded mode:
- Desktop applications (Qt, Electron, Tauri)
- CLI tools that need a local database
- Python scripts that need fast local storage without a daemon
- Any context where SQLite would be considered
Mode Comparison
| Feature | Server Mode | Embedded Mode |
|---|---|---|
| Latency | ~0.1 ms (TCP loopback) | ~1 µs (in-process) |
| Multiple processes | Yes | No (one process) |
| Installation | Binary + port | Library only |
| Compatible clients | Any MySQL client | Rust crate / C FFI |
| Ideal for | Web, APIs, microservices | Desktop, CLI, scripts |
Interactive Shell (CLI)
The axiomdb-cli binary connects directly to a database file — no server needed.
It works like sqlite3 or psql:
# Open an existing database (or create a new one)
axiomdb-cli ./mydb.db
# Pipe SQL from a file
axiomdb-cli ./mydb.db < migration.sql
# One-liner
echo "SELECT COUNT(*) FROM users;" | axiomdb-cli ./mydb.db
Inside the shell:
AxiomDB 0.1.0 — interactive shell
Type SQL ending with ; to execute. Type .help for commands.
axiomdb> CREATE TABLE users (id INT, name TEXT);
OK (1ms)
axiomdb> INSERT INTO users VALUES (1, 'Alice'), (2, 'Bob');
2 rows affected (0ms)
axiomdb> SELECT * FROM users;
+----+-------+
| id | name |
+----+-------+
| 1 | Alice |
| 2 | Bob |
+----+-------+
2 rows (0ms)
axiomdb> .tables
users
axiomdb> .schema users
Table: users
id INT NOT NULL
name TEXT nullable
axiomdb> .quit
Bye.
Dot commands: .help · .tables · .schema [table] · .open <path> · .quit
Keyboard shortcuts (interactive mode): ↑ / ↓ history · Tab SQL completion · Ctrl-R reverse search · Ctrl-C cancel line · Ctrl-D exit. History is saved to ~/.axiomdb_history between sessions.
Server Mode — Connecting
Starting the Server
# Default: stores data in ./data, listens on port 3306
axiomdb-server
# Legacy env vars
AXIOMDB_DATA=/var/lib/axiomdb AXIOMDB_PORT=3307 axiomdb-server
# DSN bootstrap (Phase 5.15)
AXIOMDB_URL='axiomdb://0.0.0.0:3307/axiomdb?data_dir=/var/lib/axiomdb' axiomdb-server
The server is ready when you see:
INFO axiomdb_server: listening on 0.0.0.0:3306
AXIOMDB_URL is normalized in shared core code first, then the server accepts only the fields it actually supports in Phase 5.15 instead of silently inventing meanings for extra options.
In Phase 5.15, AXIOMDB_URL supports axiomdb://, mysql://,
postgres://, and postgresql:// URI syntax. The alias schemes are parse
aliases only: axiomdb-server still speaks the MySQL wire protocol only.
Supported server DSN fields:
- host and port from the URI authority
data_dirfrom the query string
Unsupported query params are rejected explicitly instead of being ignored.
Connecting with the mysql CLI
mysql -h 127.0.0.1 -P 3306 -u root
No password is required in Phase 5. Any username from the allowlist (root, axiomdb,
admin) is accepted. See the Authentication section below for details.
Connecting with Python (PyMySQL)
import pymysql
conn = pymysql.connect(
host='127.0.0.1',
port=3306,
user='root',
db='axiomdb',
charset='utf8mb4',
)
with conn.cursor() as cursor:
# CREATE TABLE with AUTO_INCREMENT
cursor.execute("""
CREATE TABLE users (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
name TEXT NOT NULL,
email TEXT NOT NULL
)
""")
# INSERT — last_insert_id is returned in the OK packet
cursor.execute("INSERT INTO users (name, email) VALUES ('Alice', 'alice@example.com')")
print("inserted id:", cursor.lastrowid)
# SELECT
cursor.execute("SELECT id, name FROM users")
for row in cursor.fetchall():
print(row)
conn.close()
INSERT statements, wrap them in
an explicit BEGIN ... COMMIT. Phase 5.21 stages consecutive
INSERT ... VALUES statements in one transaction and flushes them together,
which is much faster than committing each row independently.
Parameterized Queries and ORMs (Prepared Statements)
When you pass parameters to cursor.execute(), PyMySQL (and any MySQL-compatible
driver) automatically uses COM_STMT_PREPARE / COM_STMT_EXECUTE — the MySQL
binary prepared statement protocol. AxiomDB supports this natively from Phase 5.10.
import pymysql
conn = pymysql.connect(host='127.0.0.1', port=3306, user='root', db='axiomdb')
with conn.cursor() as cursor:
cursor.execute("""
CREATE TABLE products (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
name TEXT NOT NULL,
price DOUBLE NOT NULL,
active BOOL NOT NULL DEFAULT TRUE
)
""")
conn.commit()
# Parameterized INSERT — uses COM_STMT_PREPARE/EXECUTE automatically
cursor.execute(
"INSERT INTO products (name, price, active) VALUES (%s, %s, %s)",
('Wireless Keyboard', 49.99, True),
)
# NULL parameters work transparently
cursor.execute(
"INSERT INTO products (name, price, active) VALUES (%s, %s, %s)",
('USB-C Hub', 29.99, None),
)
# Parameterized SELECT
cursor.execute("SELECT id, name, price FROM products WHERE price < %s", (50.0,))
for row in cursor.fetchall():
print(row)
# Boolean column comparison works with integer literals (MySQL-compatible)
cursor.execute("SELECT name FROM products WHERE active = %s", (1,))
for row in cursor.fetchall():
print(row)
conn.close()
ORMs such as SQLAlchemy use parameterized queries for all data-bearing operations. Connecting through the MySQL dialect works without any additional configuration:
from sqlalchemy import create_engine, text
engine = create_engine("mysql+pymysql://root@127.0.0.1:3306/axiomdb")
with engine.connect() as conn:
result = conn.execute(
text("SELECT id, name FROM products WHERE price < :max_price"),
{"max_price": 40.0},
)
for row in result:
print(row)
cursor.execute(sql, params) sends a COM_STMT_PREPARE
to parse the SQL and register a statement ID, followed by COM_STMT_EXECUTE
with the binary-encoded parameters. The statement is cached per connection in AxiomDB
and released with COM_STMT_CLOSE when the cursor closes. This matches the
behavior expected by PyMySQL, mysqlclient, and SQLAlchemy's MySQL dialect.
Connecting with PHP (PDO)
<?php
$pdo = new PDO(
'mysql:host=127.0.0.1;port=3306;dbname=axiomdb',
'root',
'',
[PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION]
);
$stmt = $pdo->query('SELECT id, name FROM users LIMIT 5');
foreach ($stmt as $row) {
echo $row['id'] . ': ' . $row['name'] . "\n";
}
Connecting with any MySQL GUI
Point MySQL Workbench, DBeaver, or TablePlus to 127.0.0.1:3306. No driver
installation is required — the MySQL wire protocol is fully compatible.
Charset and collation
AxiomDB negotiates charset and collation at the MySQL handshake boundary. The client
sends its preferred collation id in the HandshakeResponse41 packet; the server reads
it and configures the session accordingly.
Supported charsets:
| Charset | Collation ids | Notes |
|---|---|---|
utf8mb4 | 45 (0900_ai_ci), 46 (0900_as_cs), 255 (0900_ai_ci) | Default for new connections |
utf8 / utf8mb3 | 33 (general_ci), 83 (bin) | BMP-only; 4-byte code points (emoji) rejected |
latin1 | 8 (swedish_ci), 47 (bin) | MySQL latin1 = Windows-1252 (0x80 = ‘€’, not ISO-8859-1) |
binary | 63 | Raw bytes, no transcoding |
You can change the session charset at any time:
SET NAMES utf8mb4; -- sets client + connection + results
SET NAMES latin1 COLLATE latin1_bin; -- with explicit collation
SET character_set_results = utf8mb4; -- results charset only
charset='utf8mb4' in your client connection string. The AxiomDB engine
stores everything as UTF-8; utf8mb4 requires zero transcoding overhead and supports
the full Unicode range including emoji. Latin1 connections are supported for legacy
PHP/MySQL applications.
Authentication
AxiomDB Phase 5 uses permissive authentication: the server accepts any password
for usernames in the allowlist (root, axiomdb, admin, and the empty string).
Both of the most common MySQL authentication plugins are supported with no client-side
configuration required:
| Plugin | Clients | Notes |
|---|---|---|
mysql_native_password | MySQL 5.x clients, older PyMySQL, mysql2 < 0.5 | 3-packet handshake (greeting → response → OK) |
caching_sha2_password | MySQL 8.0+ default, PyMySQL >= 1.0, MySQL Connector/Python | 5-packet handshake (greeting → response → fast_auth_success → ack → OK) |
If your client connects with MySQL 8.0+ defaults and you see silent connection drops,
your client is using caching_sha2_password — AxiomDB handles this automatically.
No --default-auth flag or authPlugin option is needed.
Full password enforcement with stored credentials is planned for Phase 13 (Security).
SET NAMES, SELECT @@version, SHOW DATABASES, etc.).
AxiomDB intercepts and stubs these automatically — no configuration needed.
Monitoring with SHOW STATUS
Monitoring tools, proxy servers, and health checks can query live server counters
using the standard MySQL SHOW STATUS syntax:
SHOW STATUS
SHOW GLOBAL STATUS
SHOW SESSION STATUS
SHOW STATUS LIKE 'Threads%'
SHOW GLOBAL STATUS LIKE 'Com_%'
Available variables:
| Variable | Scope | Description |
|---|---|---|
Uptime | Global | Seconds since server start |
Threads_connected | Global | Currently authenticated connections |
Threads_running | Global | Connections actively executing a command |
Questions | Session + Global | Total statements executed |
Bytes_received | Session + Global | Bytes received from clients |
Bytes_sent | Session + Global | Bytes sent to clients |
Com_select | Session + Global | SELECT statement count |
Com_insert | Session + Global | INSERT statement count |
Innodb_buffer_pool_read_requests | Global | Storage read requests (compatibility) |
Innodb_buffer_pool_reads | Global | Physical page reads (compatibility) |
Session scope (SHOW STATUS, SHOW SESSION STATUS, SHOW LOCAL STATUS) returns
per-connection values. Global scope (SHOW GLOBAL STATUS) returns server-wide totals.
Session counters reset when a connection is closed or COM_RESET_CONNECTION is issued.
Connection Timeout Variables
AxiomDB exposes the same timeout variables that MySQL clients expect at the session level:
SET wait_timeout = 30;
SET interactive_timeout = 300;
SET net_read_timeout = 60;
SET net_write_timeout = 60;
SELECT @@wait_timeout;
SELECT @@interactive_timeout;
SELECT @@net_read_timeout;
SELECT @@net_write_timeout;
Rules:
wait_timeoutapplies while a non-interactive connection is idle between commands.interactive_timeoutapplies instead when the client connected withCLIENT_INTERACTIVE.net_write_timeoutbounds packet writes once a command is already executing.net_read_timeoutis reserved for future in-flight protocol reads and is already validated/stored as a real session variable.COM_RESET_CONNECTIONresets all four variables back to their defaults.
Trying to set one of these variables to 0 or to a non-integer value returns an
error:
SET wait_timeout = 0;
-- ERROR ... wait_timeout must be a positive integer, got '0'
Embedded Mode — Rust API
Add AxiomDB to your Cargo.toml:
[dependencies]
axiomdb-embedded = { path = "../axiomdb/crates/axiomdb-embedded" }
Open a Database
use axiomdb_embedded::Db;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let mut db = Db::open("./axiomdb.db")?;
let mut db2 = Db::open_dsn("file:/tmp/axiomdb.db")?;
let mut db3 = Db::open_dsn("axiomdb:///tmp/axiomdb")?;
db.execute("CREATE TABLE users (id INT, name TEXT, age INT)")?;
db.execute("INSERT INTO users VALUES (1, 'Alice', 30)")?;
db.execute("INSERT INTO users VALUES (2, 'Bob', 25)")?;
let (columns, rows) = db.query_with_columns(
"SELECT id, name, age FROM users WHERE age > 20 ORDER BY name"
)?;
println!("{columns:?}");
for row in rows {
println!("{row:?}");
}
Ok(())
}
Db::open_dsn(...) accepts only local DSNs in Phase 5.15. Remote
wire-endpoint DSNs such as postgres://... parse successfully in the shared
parser but are rejected by the embedded API.
Explicit Transactions
#![allow(unused)]
fn main() {
let mut db = axiomdb_embedded::Db::open("./axiomdb.db")?;
db.begin()?;
db.execute("INSERT INTO accounts VALUES (1, 'Alice', 1000.0)")?;
db.execute("INSERT INTO accounts VALUES (2, 'Bob', 500.0)")?;
db.commit()?;
}
Embedded Mode — C FFI
For C, C++, Qt, or Java (JNI):
#include "axiomdb.h"
int main(void) {
AxiomDb* db = axiomdb_open("./axiomdb.db");
AxiomDb* db2 = axiomdb_open_dsn("file:/tmp/axiomdb.db");
if (!db) { fprintf(stderr, "failed to open\n"); return 1; }
axiomdb_execute(db, "CREATE TABLE users (id INT, name TEXT)");
axiomdb_execute(db, "INSERT INTO users VALUES (1, 'Alice')");
axiomdb_close(db);
axiomdb_close(db2);
return 0;
}
Python via ctypes
import ctypes
lib = ctypes.CDLL("./libaxiomdb.dylib")
lib.axiomdb_open.restype = ctypes.c_void_p
lib.axiomdb_open_dsn.restype = ctypes.c_void_p
lib.axiomdb_close.argtypes = [ctypes.c_void_p]
lib.axiomdb_execute.restype = ctypes.c_longlong
db = lib.axiomdb_open(b"./axiomdb.db")
db2 = lib.axiomdb_open_dsn(b"file:/tmp/axiomdb.db")
lib.axiomdb_execute(db, b"CREATE TABLE t (id INT)")
lib.axiomdb_close(db)
lib.axiomdb_close(db2)
Your First Schema — End to End
The following example creates a minimal e-commerce schema, inserts sample data, and runs a join query — all within embedded mode.
-- Create tables
CREATE TABLE products (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
name TEXT NOT NULL,
price DECIMAL NOT NULL,
stock INT NOT NULL DEFAULT 0
);
CREATE TABLE orders (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
product_id BIGINT NOT NULL REFERENCES products(id) ON DELETE RESTRICT,
quantity INT NOT NULL,
placed_at TIMESTAMP NOT NULL
);
CREATE INDEX idx_orders_product ON orders (product_id);
-- Insert data
INSERT INTO products (name, price, stock) VALUES
('Wireless Keyboard', 49.99, 200),
('USB-C Hub', 29.99, 500),
('Mechanical Mouse', 39.99, 150);
INSERT INTO orders (product_id, quantity, placed_at) VALUES
(1, 2, '2026-03-01 10:00:00'),
(2, 1, '2026-03-02 14:30:00'),
(1, 1, '2026-03-03 09:15:00');
-- Query with JOIN
SELECT
p.name,
o.quantity,
p.price * o.quantity AS line_total,
o.placed_at
FROM orders o
JOIN products p ON p.id = o.product_id
ORDER BY o.placed_at;
Expected output:
| name | quantity | line_total | placed_at |
|---|---|---|---|
| Wireless Keyboard | 2 | 99.98 | 2026-03-01 10:00:00 |
| USB-C Hub | 1 | 29.99 | 2026-03-02 14:30:00 |
| Wireless Keyboard | 1 | 49.99 | 2026-03-03 09:15:00 |
Bulk Insert — Best Practices
The way you issue INSERT statements has a large impact on throughput. AxiomDB is optimized for the multi-row VALUES form — one SQL string with all N rows:
-- Fast: one SQL string, all rows in one VALUES clause (~211K rows/s for 10K rows)
INSERT INTO products (name, price, stock) VALUES
('Widget A', 9.99, 100),
('Widget B', 14.99, 50),
('Widget C', 4.99, 200);
# Python — build one multi-row string, one execute() call
rows = [(f"product_{i}", i * 1.5, i * 10) for i in range(10_000)]
placeholders = ", ".join("(%s, %s, %s)" for _ in rows)
flat_values = [v for row in rows for v in row]
cursor.execute(f"INSERT INTO products (name, price, stock) VALUES {placeholders}",
flat_values)
conn.commit()
Why this matters: issuing N separate INSERT statements each pays its own parse + analyze overhead (~20 µs per string). A single multi-row string pays that cost once for all rows.
| Approach | Throughput |
|---|---|
| Multi-row VALUES (1 string, N rows) | 211K rows/s — recommended |
| N separate INSERT strings (1 txn) | ~35K rows/s — 6× slower |
| N separate autocommit INSERTs | ~58 q/s — 1 fsync per row |
BEGIN … COMMIT
block. This limits WAL growth per transaction while keeping throughput high. See
Transactions for Group Commit configuration,
which further improves concurrent write throughput.
Next Steps
- SQL Reference — Data Types — full type system
- SQL Reference — DDL — CREATE TABLE, indexes, constraints
- SQL Reference — DML — SELECT, INSERT, UPDATE, DELETE
- Transactions — BEGIN, COMMIT, ROLLBACK, MVCC
- Performance — benchmark numbers and tuning tips
SQL Reference
This section covers the complete SQL dialect supported by AxiomDB.
- Data Types — all supported column types with storage sizes and usage examples
- DDL — Schema Definition — CREATE TABLE, CREATE INDEX, DROP TABLE, DROP INDEX, constraints
- DML — Queries & Mutations — SELECT, INSERT, UPDATE, DELETE with full clause reference
- Expressions & Operators — operators, functions, NULL semantics, LIKE, IN, BETWEEN
Data Types
AxiomDB implements a rich type system that covers the common SQL standard types as well as several extensions for modern workloads (UUID, JSON, VECTOR for AI embeddings, RANGE types for temporal and numeric overlaps).
Integer Types
| SQL Type | Aliases | Storage | Rust type | Range |
|---|---|---|---|---|
BOOL | BOOLEAN | 1 byte | bool | TRUE / FALSE |
TINYINT | INT1 | 1 byte | i8 | -128 to 127 |
UTINYINT | UINT1 | 1 byte | u8 | 0 to 255 |
SMALLINT | INT2 | 2 bytes | i16 | -32,768 to 32,767 |
USMALLINT | UINT2 | 2 bytes | u16 | 0 to 65,535 |
INT | INTEGER, INT4 | 4 bytes | i32 | -2,147,483,648 to 2,147,483,647 |
UINT | UINT4 | 4 bytes | u32 | 0 to 4,294,967,295 |
BIGINT | INT8 | 8 bytes | i64 | -9.2 × 10¹⁸ to 9.2 × 10¹⁸ |
UBIGINT | UINT8 | 8 bytes | u64 | 0 to 18.4 × 10¹⁸ (used for LSN, page_id) |
HUGEINT | INT16 | 16 bytes | i128 | ±1.7 × 10³⁸ (cryptography, checksums) |
-- Typical primary key
CREATE TABLE users (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
age SMALLINT NOT NULL
);
-- Unsigned counter that never goes negative
CREATE TABLE page_views (
page_id INT NOT NULL,
views UINT NOT NULL DEFAULT 0
);
Floating-Point Types
| SQL Type | Aliases | Storage | Rust type | Notes |
|---|---|---|---|---|
REAL | FLOAT4, FLOAT | 4 bytes | f32 | Coordinates, ratings, embeddings |
DOUBLE | FLOAT8, DOUBLE PRECISION | 8 bytes | f64 | Scientific calculations |
NaN is forbidden. The row codec rejects
NaNvalues at encode time. IEEE 754 infinities are also not accepted by default.
-- Geospatial coordinates (4-byte precision is sufficient)
CREATE TABLE locations (
id INT PRIMARY KEY,
lat REAL NOT NULL,
lon REAL NOT NULL
);
-- Scientific measurements requiring high precision
CREATE TABLE experiments (
id INT PRIMARY KEY,
result DOUBLE NOT NULL
);
Exact Numeric — DECIMAL
| SQL Type | Aliases | Storage | Rust type | Notes |
|---|---|---|---|---|
DECIMAL(p, s) | NUMERIC(p, s) | 17 bytes | i128 + u8 scale | Exact arithmetic, no float error |
Always use DECIMAL for money. Floating-point types cannot represent
0.1 + 0.2 exactly; DECIMAL always can.
CREATE TABLE invoices (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
subtotal DECIMAL NOT NULL, -- DECIMAL without precision = DECIMAL(38,0)
tax_rate DECIMAL NOT NULL,
total DECIMAL NOT NULL
);
-- Insert with exact values
INSERT INTO invoices (subtotal, tax_rate, total)
VALUES (199.99, 0.19, 237.99);
-- Arithmetic is always exact
SELECT subtotal * tax_rate AS computed_tax FROM invoices WHERE id = 1;
-- Returns: 37.9981 (never 37.99809999999...)
The internal codec stores DECIMAL as a 16-byte little-endian i128 mantissa followed
by a 1-byte scale (total 17 bytes per non-NULL value).
Text Types
| SQL Type | Aliases | Max length | Rust type | Notes |
|---|---|---|---|---|
CHAR(n) | n bytes (fixed) | [u8; n] | Right-padded with spaces | |
VARCHAR(n) | n bytes (max) | String | Variable, UTF-8 | |
TEXT | 16,777,215 bytes | String | Unlimited (TOAST if >16 KB) | |
CITEXT | 16,777,215 bytes | String | Case-insensitive comparison |
The codec encodes TEXT and VARCHAR with a 3-byte (u24) length prefix followed by
raw UTF-8 bytes. This limits inline storage to 16,777,215 bytes; values larger than a
page use TOAST (planned Phase 6).
-- Fixed-length codes (ISO country, state abbreviations)
CREATE TABLE countries (
code CHAR(2) PRIMARY KEY, -- 'US', 'DE', 'JP'
name VARCHAR(128) NOT NULL
);
-- Unlimited text content
CREATE TABLE blog_posts (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
title VARCHAR(512) NOT NULL,
body TEXT NOT NULL
);
-- Case-insensitive email lookup
CREATE TABLE users (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
email CITEXT NOT NULL UNIQUE
);
-- SELECT * FROM users WHERE email = 'ALICE@EXAMPLE.COM'
-- matches rows where email = 'alice@example.com'
Binary Type
| SQL Type | Aliases | Max length | Rust type | Notes |
|---|---|---|---|---|
BYTEA | BLOB, BYTES | 16,777,215 bytes | Vec<u8> | Raw bytes, hex display |
CREATE TABLE attachments (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
name TEXT NOT NULL,
content BYTEA NOT NULL
);
-- Insert binary with hex literal
INSERT INTO attachments (name, content) VALUES ('icon.png', X'89504e47');
-- Display as hex
SELECT name, encode(content, 'hex') FROM attachments;
Date and Time Types
| SQL Type | Storage | Internal repr | Notes |
|---|---|---|---|
DATE | 4 bytes | i32 days since 1970-01-01 | No time component |
TIME | 8 bytes | i64 µs since midnight | No timezone |
TIMETZ | 12 bytes | i64 µs + i32 offset | Time with timezone offset |
TIMESTAMP | 8 bytes | i64 µs since UTC epoch | Without timezone (ambiguous) |
TIMESTAMPTZ | 8 bytes | i64 µs UTC | Recommended. Always UTC internally |
INTERVAL | 16 bytes | i32 months + i32 days + i64 µs | Correct calendar arithmetic |
Prefer
TIMESTAMPTZoverTIMESTAMP. Without a timezone, there is no way to determine the absolute instant when the server and client are in different timezones.TIMESTAMPTZstores everything as UTC and converts on display.
CREATE TABLE events (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
title TEXT NOT NULL,
starts_at TIMESTAMPTZ NOT NULL,
ends_at TIMESTAMPTZ NOT NULL,
duration INTERVAL
);
INSERT INTO events (title, starts_at, ends_at, duration)
VALUES (
'Team meeting',
'2026-03-21 10:00:00+00',
'2026-03-21 11:00:00+00',
'1 hour'
);
INTERVAL — Calendar-Correct Arithmetic
INTERVAL separates months, days, and microseconds because they are not fixed durations:
- “1 month” added to January 31 gives February 28 (or 29).
- “1 day” during a DST transition can be 23 or 25 hours.
-- Add 1 month to a date (calendar-aware)
SELECT '2026-01-31'::DATE + INTERVAL '1 month'; -- 2026-02-28
-- Add 30 days (fixed)
SELECT '2026-01-31'::DATE + INTERVAL '30 days'; -- 2026-03-02
UUID
| SQL Type | Storage | Notes |
|---|---|---|
UUID | 16 bytes | Stored as raw 16 bytes, displayed as hex |
CREATE TABLE sessions (
id UUID PRIMARY KEY DEFAULT gen_uuid_v7(),
user_id BIGINT NOT NULL,
created_at TIMESTAMPTZ NOT NULL
);
UUID v7 vs v4 as Primary Key:
| Strategy | Insert rate (1M rows) | Reason |
|---|---|---|
| UUID v4 | ~150k inserts/s | Random → many B+ Tree page splits |
| UUID v7 | ~250k inserts/s | Time-ordered prefix → nearly sequential |
| BIGINT | ~280k inserts/s | Fully sequential |
For new schemas, prefer UUID v7 (time-sortable) or BIGINT AUTO_INCREMENT.
Network Types
| SQL Type | Storage | Notes |
|---|---|---|
INET | 16 bytes | IPv4 or IPv6 address |
CIDR | 17 bytes | IP network with prefix mask |
MACADDR | 6 bytes | MAC address |
CREATE TABLE access_log (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
client_ip INET NOT NULL,
network CIDR,
mac MACADDR
);
JSON / JSONB
| SQL Type | Aliases | Notes |
|---|---|---|
JSON | JSONB | Stored as serialized JSON; TOAST if > 2 KB |
CREATE TABLE api_responses (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
endpoint TEXT NOT NULL,
payload JSON NOT NULL
);
INSERT INTO api_responses (endpoint, payload)
VALUES ('/users', '{"count": 42, "items": []}');
VECTOR — AI Embeddings
| SQL Type | Storage | Notes |
|---|---|---|
VECTOR(n) | 4n bytes | Array of n 32-bit floats (f32) |
-- Store sentence embeddings from an AI model
CREATE TABLE documents (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
content TEXT NOT NULL,
embedding VECTOR(384) NOT NULL -- e.g. all-MiniLM-L6-v2 output
);
-- Approximate nearest-neighbor search (ANN index required)
SELECT id, content
FROM documents
ORDER BY embedding <-> '[0.12, 0.34, ...]'::vector
LIMIT 10;
RANGE Types
RANGE types represent a continuous span of a base type, with inclusive/exclusive
bounds. They support containment (@>), overlapping (&&), and
exclusion constraints.
| SQL Type | Base type | Example |
|---|---|---|
INT4RANGE | INT | [1, 100) |
INT8RANGE | BIGINT | [1000, 9999] |
DATERANGE | DATE | [2026-01-01, 2026-12-31] |
TSRANGE | TIMESTAMP | [2026-01-01 09:00, ...) |
TSTZRANGE | TIMESTAMPTZ | timezone-aware variant |
-- Prevent overlapping reservations using an exclusion constraint
CREATE TABLE room_reservations (
room_id INT NOT NULL,
period TSRANGE NOT NULL,
EXCLUDE USING gist(room_id WITH =, period WITH &&)
);
INSERT INTO room_reservations VALUES (1, '[2026-03-21 09:00, 2026-03-21 11:00)');
-- This next insert fails: the period overlaps with the existing row
INSERT INTO room_reservations VALUES (1, '[2026-03-21 10:00, 2026-03-21 12:00)');
-- ERROR: exclusion constraint violation
NULL in Every Type
Every column of every type can hold NULL unless declared NOT NULL. The row codec
stores a compact null bitmap at the start of each row (1 bit per column), so NULL
costs only 1 bit of overhead regardless of the underlying type size.
SELECT NULL + 5; -- NULL (any arithmetic with NULL propagates NULL)
SELECT NULL = NULL; -- NULL (not TRUE — use IS NULL instead)
SELECT NULL IS NULL; -- TRUE
SELECT COALESCE(NULL, 0); -- 0 (return first non-NULL argument)
See Expressions & Operators for the full NULL semantics table.
DDL — Schema Definition Language
DDL statements define and modify the structure of the database: tables, columns, constraints, and indexes. All DDL operations are transactional in AxiomDB — a failed DDL statement is automatically rolled back.
CREATE DATABASE
Creates a new logical database in the persisted catalog.
Syntax
CREATE DATABASE database_name;
Example
CREATE DATABASE analytics;
SHOW DATABASES;
Expected output includes:
| Database |
|---|
| analytics |
| axiomdb |
CREATE DATABASE fails if the name already exists:
CREATE DATABASE analytics;
-- ERROR 1007 (HY000): Can't create database 'analytics'; database exists
DROP DATABASE
Removes a logical database from the catalog.
Syntax
DROP DATABASE database_name;
DROP DATABASE IF EXISTS database_name;
Behavior
- Removing a database also removes the tables it owns from SQL/catalog lookup.
IF EXISTSsuppresses the error for a missing database.- The current connection cannot drop the database it has selected with
USE.
DROP DATABASE analytics;
DROP DATABASE IF EXISTS scratch;
USE analytics;
DROP DATABASE analytics;
-- ERROR 1105 (HY000): Can't drop database 'analytics'; database is currently selected
CREATE DATABASE and DROP DATABASE are catalog-backed today, but
cross-database queries such as other_db.public.users are still deferred to the
next multi-database subphase.
CREATE TABLE
Basic Syntax
CREATE TABLE [IF NOT EXISTS] table_name (
column_name data_type [column_constraints...],
...
[table_constraints...]
);
Column Constraints
NOT NULL
Rejects any attempt to insert or update a row with a NULL value in this column.
CREATE TABLE employees (
id BIGINT NOT NULL,
name TEXT NOT NULL,
dept TEXT -- nullable: dept may be unassigned
);
DEFAULT
Provides a value when the column is omitted from INSERT.
CREATE TABLE orders (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
status TEXT NOT NULL DEFAULT 'pending',
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
priority INT NOT NULL DEFAULT 0
);
-- Default values are used automatically
INSERT INTO orders (status) VALUES ('shipped');
-- Row: id=<auto>, status='shipped', created_at=<now>, priority=0
PRIMARY KEY
Declares a column (or set of columns) as the primary key. A primary key:
- Implies
NOT NULL - Creates a unique B+ Tree index automatically
- Is used for
REFERENCESin foreign keys
-- Single-column primary key
CREATE TABLE users (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
name TEXT NOT NULL
);
-- Composite primary key (declared as table constraint)
CREATE TABLE order_items (
order_id BIGINT NOT NULL,
product_id BIGINT NOT NULL,
quantity INT NOT NULL,
PRIMARY KEY (order_id, product_id)
);
UNIQUE
Guarantees no two rows share the same value in this column (or set of columns). NULL values are excluded from uniqueness checks — multiple NULLs are allowed.
CREATE TABLE accounts (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
email TEXT NOT NULL UNIQUE,
username TEXT NOT NULL UNIQUE
);
AUTO_INCREMENT / SERIAL
Automatically generates a monotonically increasing integer for each new row. The counter starts at 1 and increments by 1 for each inserted row. The following forms are all equivalent:
-- MySQL-style
id BIGINT PRIMARY KEY AUTO_INCREMENT
-- PostgreSQL-style shorthand (SERIAL = INT AUTO_INCREMENT, BIGSERIAL = BIGINT AUTO_INCREMENT)
id SERIAL PRIMARY KEY
id BIGSERIAL PRIMARY KEY
Behavior:
CREATE TABLE users (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
name TEXT NOT NULL
);
-- Omit the AUTO_INCREMENT column — the engine generates the value
INSERT INTO users (name) VALUES ('Alice'); -- id = 1
INSERT INTO users (name) VALUES ('Bob'); -- id = 2
-- Retrieve the last generated ID (current session only)
SELECT LAST_INSERT_ID(); -- returns 2
SELECT lastval(); -- PostgreSQL alias — same result
-- Multi-row INSERT: LAST_INSERT_ID() returns the ID of the FIRST row in the batch
INSERT INTO users (name) VALUES ('Carol'), ('Dave'); -- ids: 3, 4
SELECT LAST_INSERT_ID(); -- returns 3
-- Explicit non-NULL value bypasses the sequence and does NOT advance it
INSERT INTO users (id, name) VALUES (100, 'Eve');
-- id=100; sequence remains at 4; next auto id will be 5
LAST_INSERT_ID() returns 0 if no auto-increment INSERT has been performed
in the current session. See LAST_INSERT_ID() in expressions
for the full function reference.
TRUNCATE resets the counter:
TRUNCATE TABLE users;
INSERT INTO users (name) VALUES ('Frank'); -- id = 1 (reset by TRUNCATE)
REFERENCES — Foreign Keys
Declares a foreign key relationship to another table’s primary key.
CREATE TABLE orders (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
user_id BIGINT NOT NULL REFERENCES users(id) ON DELETE CASCADE,
product_id BIGINT NOT NULL REFERENCES products(id) ON DELETE RESTRICT,
placed_at TIMESTAMP NOT NULL
);
ON DELETE actions:
| Action | Behavior when the referenced row is deleted |
|---|---|
RESTRICT | Reject the DELETE if any referencing row exists (default) |
CASCADE | Delete all referencing rows automatically |
SET NULL | Set the foreign key column to NULL |
SET DEFAULT | Set the foreign key column to its DEFAULT value |
NO ACTION | Same as RESTRICT but deferred to end of statement |
ON UPDATE actions: Same options as ON DELETE — apply when the referenced primary key is updated.
Current limitation: Only
ON UPDATE RESTRICT(the default) is enforced.ON UPDATE CASCADEandON UPDATE SET NULLreturnNotImplementedand are planned for Phase 6.10. WriteON UPDATE RESTRICTor omit the clause entirely for correct behaviour today.
CREATE TABLE order_items (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
order_id BIGINT NOT NULL
REFERENCES orders(id)
ON DELETE CASCADE
ON UPDATE CASCADE,
product_id BIGINT NOT NULL
REFERENCES products(id)
ON DELETE RESTRICT
ON UPDATE RESTRICT,
quantity INT NOT NULL,
unit_price DECIMAL NOT NULL
);
CHECK
Validates that a condition is TRUE for every row. A row where the CHECK condition evaluates to FALSE or NULL is rejected.
CREATE TABLE products (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
name TEXT NOT NULL,
price DECIMAL NOT NULL CHECK (price > 0),
stock INT NOT NULL CHECK (stock >= 0),
rating REAL CHECK (rating IS NULL OR (rating >= 1.0 AND rating <= 5.0))
);
Table-Level Constraints
Table constraints apply to multiple columns and are declared after all column definitions.
CREATE TABLE shipments (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
order_id BIGINT NOT NULL,
warehouse_id INT NOT NULL,
shipped_at TIMESTAMP,
delivered_at TIMESTAMP,
-- Named constraints (recommended for meaningful error messages)
CONSTRAINT fk_shipment_order
FOREIGN KEY (order_id) REFERENCES orders(id) ON DELETE CASCADE,
CONSTRAINT chk_delivery_after_shipment
CHECK (delivered_at IS NULL OR delivered_at >= shipped_at),
CONSTRAINT uq_one_active_shipment
UNIQUE (order_id, warehouse_id)
);
IF NOT EXISTS
Suppresses the error when the table already exists. Useful in migration scripts.
CREATE TABLE IF NOT EXISTS config (
key TEXT NOT NULL UNIQUE,
value TEXT NOT NULL
);
Full Example — E-commerce Schema
CREATE TABLE users (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
email TEXT NOT NULL UNIQUE,
name TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
deleted_at TIMESTAMPTZ
);
CREATE TABLE categories (
id INT PRIMARY KEY AUTO_INCREMENT,
name TEXT NOT NULL UNIQUE
);
CREATE TABLE products (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
category_id INT NOT NULL REFERENCES categories(id),
name TEXT NOT NULL,
description TEXT,
price DECIMAL NOT NULL CHECK (price > 0),
stock INT NOT NULL DEFAULT 0 CHECK (stock >= 0),
created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE orders (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
user_id BIGINT NOT NULL REFERENCES users(id) ON DELETE RESTRICT,
total DECIMAL NOT NULL CHECK (total >= 0),
status TEXT NOT NULL DEFAULT 'pending',
placed_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
shipped_at TIMESTAMPTZ,
CONSTRAINT chk_order_status CHECK (
status IN ('pending', 'paid', 'shipped', 'delivered', 'cancelled')
)
);
CREATE TABLE order_items (
order_id BIGINT NOT NULL REFERENCES orders(id) ON DELETE CASCADE,
product_id BIGINT NOT NULL REFERENCES products(id) ON DELETE RESTRICT,
quantity INT NOT NULL CHECK (quantity > 0),
unit_price DECIMAL NOT NULL CHECK (unit_price > 0),
PRIMARY KEY (order_id, product_id)
);
CREATE INDEX
Indexes accelerate lookups and range scans. AxiomDB automatically creates a unique B+
Tree index for every PRIMARY KEY and UNIQUE constraint. Additional indexes are created
explicitly. CREATE INDEX works on both heap tables and clustered (PRIMARY KEY) tables.
Basic Syntax
CREATE [UNIQUE] INDEX [IF NOT EXISTS] index_name
ON table_name (column [ASC|DESC], ...)
[WITH (fillfactor = N)]
[WHERE condition];
fillfactor controls how full a B-Tree leaf page gets before splitting (10–100,
default 90). Lower values leave room for future inserts without triggering splits.
See Fill Factor for details.
Examples
-- Standard index
CREATE INDEX idx_users_email ON users (email);
-- Composite index: queries filtering by (user_id, placed_at) benefit
CREATE INDEX idx_orders_user_date ON orders (user_id, placed_at DESC);
-- Unique index (equivalent to UNIQUE column constraint)
CREATE UNIQUE INDEX uq_products_sku ON products (sku);
-- Partial index: index only active products (reduces index size)
CREATE INDEX idx_active_products ON products (category_id)
WHERE deleted_at IS NULL;
-- Fill factor: append-heavy time-series table (leaves 30% free for inserts)
CREATE INDEX idx_ts ON events(created_at) WITH (fillfactor = 70);
-- Fill factor + partial index combined
CREATE UNIQUE INDEX uq_active_email ON users(email)
WHERE deleted_at IS NULL
-- WITH clause can appear before or after WHERE (both are accepted)
When to Add an Index
- Columns appearing in
WHERE,JOIN ON, orORDER BYclauses on large tables - Foreign key columns (AxiomDB does not auto-index FK columns — add them explicitly)
- Columns used in range queries (
BETWEEN,>,<)
See Indexes for the query planner interaction and composite index column ordering rules.
DROP TABLE
Removes a table and all its data permanently.
DROP TABLE [IF EXISTS] table_name [CASCADE | RESTRICT];
| Option | Behavior |
|---|---|
RESTRICT | Fail if any other table has a foreign key referencing this table (default) |
CASCADE | Also drop all foreign key constraints that reference this table |
-- Safe drop: fails if referenced by other tables
DROP TABLE products;
-- Drop without error if already gone
DROP TABLE IF EXISTS temp_import;
-- Drop even if referenced (removes FK constraints first)
DROP TABLE categories CASCADE;
Dropping a table is immediate and permanent. There is no RECYCLE BIN. Make sure you have a backup or are inside a transaction if you need to recover.
DROP INDEX
Removes an index. The table and its data are not affected.
DROP INDEX [IF EXISTS] index_name;
DROP INDEX idx_users_email;
DROP INDEX IF EXISTS idx_old_lookup;
ALTER TABLE
Modifies the structure of an existing table. All four forms are blocking operations — no concurrent DDL is allowed while an ALTER TABLE is in progress.
Add Column
Adds a new column at the end of the column list. If existing rows are present,
they are rewritten to include the default value for the new column. If no
DEFAULT clause is given, existing rows receive NULL for that column.
ALTER TABLE table_name ADD COLUMN column_name data_type [NOT NULL] [DEFAULT expr];
-- Add a nullable column (existing rows get NULL)
ALTER TABLE users ADD COLUMN phone TEXT;
-- Add a NOT NULL column with a default (existing rows get 0)
ALTER TABLE orders ADD COLUMN priority INT NOT NULL DEFAULT 0;
-- Add a column with a string default
ALTER TABLE products ADD COLUMN status TEXT NOT NULL DEFAULT 'active';
A column with
NOT NULLand noDEFAULTcannot be added to a non-empty table — existing rows would have no value to fill in and would violate the constraint. Provide aDEFAULTvalue, or add the column as nullable first and back-fill the data before adding the constraint.
Drop Column
Removes a column from the table. All existing rows are rewritten without the
dropped column’s value. The column name must exist unless IF EXISTS is used.
ALTER TABLE table_name DROP COLUMN column_name [IF EXISTS];
-- Remove a column (fails if the column does not exist)
ALTER TABLE users DROP COLUMN phone;
-- Remove a column only if it exists (idempotent, safe in migrations)
ALTER TABLE users DROP COLUMN phone IF EXISTS;
Dropping a column is permanent. The data stored in that column is discarded when rows are rewritten and cannot be recovered without a backup.
Dropping a column that is part of a UNIQUE index or a FOREIGN KEY is rejected with an error. Drop the index or constraint first, then drop the column. Dropping a PRIMARY KEY column is not allowed on clustered tables (the PK is the physical storage key).
Modify Column
Changes the data type or nullability of an existing column. All existing rows are rewritten, coercing their stored values to the new type.
ALTER TABLE table_name MODIFY COLUMN column_name new_type [NOT NULL];
-- Widen an integer column to 64 bits (existing values preserved)
ALTER TABLE events MODIFY COLUMN count BIGINT;
-- Convert integers to text (always safe, values become their decimal string)
ALTER TABLE codes MODIFY COLUMN code TEXT;
-- Add a NOT NULL constraint (fails if any row has NULL in that column)
ALTER TABLE orders MODIFY COLUMN status TEXT NOT NULL;
Rules and restrictions:
- Narrowing casts (e.g.
BIGINT → INT,TEXT → INT) are applied with strict coercion. If any existing value cannot be represented in the new type the statement fails and no rows are changed. - A column that is part of a secondary index (UNIQUE or otherwise) cannot have its type changed. Drop the index first, modify the column, then recreate the index.
- The PRIMARY KEY column’s type cannot be changed on a clustered table.
- Changing nullability from nullable to
NOT NULLis allowed only when every existing row has a non-NULL value for that column.
Rename Column
Renames an existing column. This is a catalog-only operation — no rows are rewritten because the positional encoding is not affected by column names.
ALTER TABLE table_name RENAME COLUMN old_name TO new_name;
-- Rename a column
ALTER TABLE users RENAME COLUMN full_name TO display_name;
-- Rename to fix a typo
ALTER TABLE orders RENAME COLUMN shiped_at TO shipped_at;
Rename Table
Renames the table itself. This is a catalog-only operation.
ALTER TABLE old_name RENAME TO new_name;
-- Rename during a refactoring
ALTER TABLE user_profiles RENAME TO profiles;
-- Rename a staging table after a migration
ALTER TABLE orders_import RENAME TO orders;
Rebuild To Clustered
Migrates a legacy heap table that already has PRIMARY KEY metadata into clustered storage.
ALTER TABLE table_name REBUILD;
Example:
-- After opening an older AxiomDB database where `users` is still heap-backed
ALTER TABLE users REBUILD;
Behavior:
- walks the existing PRIMARY KEY index in logical key order
- rebuilds the table into a clustered PRIMARY KEY tree
- rebuilds every non-primary index so it stores clustered PK bookmarks instead
of heap
RecordIds - swaps the catalog metadata atomically at the end of the statement
Common errors:
ALTER TABLE logs REBUILD;
-- ERROR 1105 (HY000): ALTER TABLE REBUILD requires a PRIMARY KEY on 'logs'
ALTER TABLE users REBUILD;
-- ERROR 1105 (HY000): table 'users' is already clustered
CLUSTER and InnoDB sorted-rebuild ideas: build the new clustered roots first, then swap catalog metadata. AxiomDB adds deferred free of the old heap/index pages so the metadata swap never races with page reclamation.
Not Yet Supported
The following ALTER TABLE forms are planned for Phase 4.22b and later:
MODIFY COLUMN/ALTER COLUMN— changing a column’s data typeADD CONSTRAINT— adding a CHECK, UNIQUE, or FOREIGN KEY after table creationDROP CONSTRAINT— removing a named constraint- Dropping columns that participate in a constraint
TRUNCATE TABLE
Removes all rows from a table without dropping its structure, and resets the
AUTO_INCREMENT counter to 1. The table schema, indexes, and constraints are
preserved.
TRUNCATE TABLE table_name;
-- Wipe a staging table before re-importing
TRUNCATE TABLE import_staging;
-- AUTO_INCREMENT is always reset after TRUNCATE
CREATE TABLE log_events (id INT AUTO_INCREMENT PRIMARY KEY, msg TEXT);
INSERT INTO log_events (msg) VALUES ('start'), ('end'); -- ids: 1, 2
TRUNCATE TABLE log_events;
INSERT INTO log_events (msg) VALUES ('restart'); -- id: 1
Returns Affected { count: 0 } (MySQL convention). See also
TRUNCATE TABLE in the DML reference for a comparison
with DELETE FROM table.
ANALYZE
Refreshes per-column statistics used by the query planner to choose between an index scan and a full table scan.
ANALYZE; -- all tables in the current schema
ANALYZE TABLE table_name; -- specific table, all indexed columns
ANALYZE TABLE table_name (col); -- specific table, one column only
ANALYZE computes exact row_count and NDV (number of distinct non-NULL
values) for each target column by scanning the full table. Results are stored
in the axiom_stats system catalog and are immediately available to the planner.
-- After a bulk import, refresh stats so the planner uses correct selectivity:
INSERT INTO products SELECT * FROM products_staging;
ANALYZE TABLE products;
-- Check a single column after targeted inserts:
ANALYZE TABLE orders (status);
See Index Statistics for how NDV and row_count affect query planning decisions.
DML — Queries and Mutations
DML statements read and modify table data: SELECT, INSERT, UPDATE, and DELETE.
All DML operations participate in the current transaction and are subject to MVCC
isolation.
SELECT
Full Syntax
SELECT [DISTINCT] select_list
FROM table_ref [AS alias]
[JOIN ...]
[WHERE condition]
[GROUP BY column_list]
[HAVING condition]
[ORDER BY column_list [ASC|DESC] [NULLS FIRST|LAST]]
[LIMIT n [OFFSET m]];
Basic Projections
-- All columns
SELECT * FROM users;
-- Specific columns with aliases
SELECT id, email AS user_email, name AS full_name
FROM users;
-- Computed columns
SELECT
name,
price * 1.19 AS price_with_tax,
UPPER(name) AS name_upper
FROM products;
DISTINCT
Removes duplicate rows from the result. Two rows are duplicates if every selected column has the same value (NULL = NULL for this purpose only).
-- All distinct status values in the orders table
SELECT DISTINCT status FROM orders;
-- All distinct (category_id, status) pairs
SELECT DISTINCT category_id, status FROM products ORDER BY category_id;
FROM and JOIN
Simple FROM
SELECT * FROM products;
SELECT p.* FROM products AS p WHERE p.price > 50;
INNER JOIN
Returns only rows where the join condition matches in both tables.
SELECT
u.name,
o.id AS order_id,
o.total,
o.status
FROM users u
INNER JOIN orders o ON o.user_id = u.id
WHERE o.status = 'shipped'
ORDER BY o.placed_at DESC;
LEFT JOIN
Returns all rows from the left table; columns from the right table are NULL when there is no matching row.
-- All users, including those with no orders
SELECT
u.id,
u.name,
COUNT(o.id) AS total_orders
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
GROUP BY u.id, u.name
ORDER BY total_orders DESC;
RIGHT JOIN
Returns all rows from the right table; left table columns are NULL on no match. Less common — most RIGHT JOINs can be rewritten as LEFT JOINs by swapping tables.
SELECT p.name, SUM(oi.quantity) AS total_sold
FROM order_items oi
RIGHT JOIN products p ON p.id = oi.product_id
GROUP BY p.id, p.name;
FULL OUTER JOIN
Returns all rows from both tables. Matched rows are joined normally.
Unmatched rows from either side are padded with NULL on the missing side.
AxiomDB extension over the MySQL wire protocol. MySQL does not support
FULL OUTER JOIN. AxiomDB clients connecting via the MySQL wire protocol can use it, but standard MySQL clients may not send it.
-- Audit: find users with no orders AND orders with no valid user
SELECT
u.id AS user_id,
u.name AS user_name,
o.id AS order_id,
o.total
FROM users u
FULL OUTER JOIN orders o ON u.id = o.user_id
ORDER BY u.id, o.id;
| user_id | user_name | order_id | total |
|---|---|---|---|
| 1 | Alice | 10 | 100 |
| 1 | Alice | 11 | 200 |
| 2 | Bob | 12 | 50 |
| 3 | Carol | NULL | NULL |
| NULL | NULL | 13 | 300 |
Both FULL JOIN and FULL OUTER JOIN are accepted.
ON vs WHERE semantics:
ONpredicates are evaluated before null-extension. Rows that do not satisfyONare treated as unmatched and receive NULLs.WHEREpredicates run after the full join is materialized. AddingWHERE u.id IS NOT NULLremoves unmatched right rows from the result.
-- ON vs WHERE: only keep rows where the user side is not NULL
SELECT u.id, o.id
FROM users u
FULL OUTER JOIN orders o ON u.id = o.user_id
WHERE u.id IS NOT NULL; -- removes the (NULL, 13) row
Nullability: In SELECT * over a FULL OUTER JOIN, all columns from both
tables are marked nullable even if the catalog defines them as NOT NULL,
because either side can be null-extended.
CROSS JOIN
Cartesian product — every row from the left table combined with every row from
the right table. Use with care: m × n rows.
-- Generate all combinations of size and color for a product grid
SELECT sizes.label AS size, colors.label AS color
FROM sizes
CROSS JOIN colors
ORDER BY sizes.sort_order, colors.sort_order;
Multi-Table JOIN
SELECT
u.name AS customer,
p.name AS product,
oi.quantity,
oi.unit_price,
oi.quantity * oi.unit_price AS line_total
FROM orders o
JOIN users u ON u.id = o.user_id
JOIN order_items oi ON oi.order_id = o.id
JOIN products p ON p.id = oi.product_id
WHERE o.status = 'delivered'
ORDER BY o.placed_at DESC, p.name;
WHERE
Filters rows before aggregation. Accepts any boolean expression.
-- Equality and comparison
SELECT * FROM products WHERE price > 100 AND stock > 0;
-- NULL check
SELECT * FROM users WHERE deleted_at IS NULL;
SELECT * FROM orders WHERE shipped_at IS NOT NULL;
-- BETWEEN (inclusive on both ends)
SELECT * FROM orders
WHERE placed_at BETWEEN '2026-01-01' AND '2026-03-31';
-- IN list
SELECT * FROM orders WHERE status IN ('pending', 'paid', 'shipped');
-- LIKE pattern matching (% = any sequence, _ = exactly one character)
SELECT * FROM users WHERE email LIKE '%@example.com';
SELECT * FROM products WHERE name LIKE 'USB-_';
-- NOT variants
SELECT * FROM orders WHERE status NOT IN ('cancelled', 'refunded');
SELECT * FROM products WHERE name NOT LIKE 'Test%';
Subqueries
A subquery is a SELECT statement nested inside another statement. AxiomDB supports
five subquery forms, each with full NULL semantics identical to PostgreSQL and MySQL.
Scalar Subqueries
A scalar subquery appears anywhere an expression is valid (SELECT list, WHERE, HAVING,
ORDER BY). It must return exactly one column. If it returns zero rows, the result is
NULL. If it returns more than one row, AxiomDB raises CardinalityViolation
(SQLSTATE 21000).
-- Compare each product price against the overall average
SELECT
name,
price,
price - (SELECT AVG(price) FROM products) AS diff_from_avg
FROM products
ORDER BY diff_from_avg DESC;
-- Find the most recently placed order date
SELECT * FROM orders
WHERE placed_at = (SELECT MAX(placed_at) FROM orders);
-- Use a scalar subquery in HAVING
SELECT user_id, COUNT(*) AS order_count
FROM orders
GROUP BY user_id
HAVING COUNT(*) > (SELECT AVG(cnt) FROM (SELECT COUNT(*) AS cnt FROM orders GROUP BY user_id) AS sub);
If the subquery returns more than one row, AxiomDB raises:
ERROR 21000: subquery must return exactly one row, but returned 3 rows
Use LIMIT 1 or a unique WHERE predicate to guarantee a single row.
IN Subquery
expr [NOT] IN (SELECT col FROM ...) tests whether a value appears in the set of
values produced by the subquery.
-- Orders for users who have placed more than 5 orders total
SELECT * FROM orders
WHERE user_id IN (
SELECT user_id FROM orders GROUP BY user_id HAVING COUNT(*) > 5
);
-- Products never sold
SELECT * FROM products
WHERE id NOT IN (
SELECT DISTINCT product_id FROM order_items
);
NULL semantics — fully consistent with the SQL standard:
| Value in outer expr | Subquery result | Result |
|---|---|---|
'Alice' | contains 'Alice' | TRUE |
'Alice' | does not contain 'Alice', no NULLs | FALSE |
'Alice' | does not contain 'Alice', contains NULL | NULL |
NULL | any non-empty set | NULL |
NULL | empty set | NULL |
The third row is the subtle case: x NOT IN (subquery with NULLs) returns NULL,
not FALSE. This means NOT IN combined with a subquery that may produce NULLs
can silently exclude rows. A safe alternative is NOT EXISTS.
EXISTS / NOT EXISTS
[NOT] EXISTS (SELECT ...) tests whether the subquery produces at least one row.
The result is always TRUE or FALSE — never NULL.
-- Users who have at least one paid order
SELECT * FROM users u
WHERE EXISTS (
SELECT 1 FROM orders o
WHERE o.user_id = u.id AND o.status = 'paid'
);
-- Products with no associated order items
SELECT * FROM products p
WHERE NOT EXISTS (
SELECT 1 FROM order_items oi WHERE oi.product_id = p.id
);
The select list inside an EXISTS subquery does not matter — SELECT 1, SELECT *,
and SELECT id all behave identically. The engine only checks for row existence.
Correlated Subqueries
A correlated subquery references columns from the outer query. AxiomDB re-executes the subquery for each outer row, substituting the current outer column values.
-- For each order, fetch the user's name (correlated scalar subquery in SELECT list)
SELECT
o.id,
o.total,
(SELECT u.name FROM users u WHERE u.id = o.user_id) AS customer_name
FROM orders o;
-- Orders whose total exceeds the average total for that user (correlated in WHERE)
SELECT * FROM orders o
WHERE o.total > (
SELECT AVG(total) FROM orders WHERE user_id = o.user_id
);
-- Active products with above-average stock in their category
SELECT * FROM products p
WHERE p.stock > (
SELECT AVG(stock) FROM products WHERE category_id = p.category_id
);
Correlated subqueries with large outer result sets can be slow (O(n) re-executions). For performance-critical paths, rewrite them as JOINs with aggregation.
Derived Tables (FROM Subquery)
A subquery in the FROM clause is called a derived table. It must have an alias.
AxiomDB materializes the derived table result in memory before executing the outer query.
-- Top spenders, computed as a subquery and then filtered
SELECT customer_name, total_spent
FROM (
SELECT u.name AS customer_name, SUM(o.total) AS total_spent
FROM users u
JOIN orders o ON o.user_id = u.id
WHERE o.status = 'delivered'
GROUP BY u.id, u.name
) AS spending
WHERE total_spent > 500
ORDER BY total_spent DESC;
-- Percentile bucketing: compute rank in a subquery, filter in outer
SELECT *
FROM (
SELECT
id,
name,
price,
RANK() OVER (ORDER BY price DESC) AS price_rank
FROM products
) AS ranked
WHERE price_rank <= 10;
IN (subquery) as PostgreSQL and MySQL: a non-matching lookup against a set that contains NULL returns NULL, not FALSE. This matches ISO SQL:2016 and avoids the "missing row" trap that catches developers when NOT IN is used against a nullable foreign key column. Every subquery form (scalar, IN, EXISTS, correlated, derived table) follows the same rules as PostgreSQL 15.
GROUP BY and HAVING
GROUP BY collapses rows with the same values in the specified columns into a single
output row. Aggregate functions operate over each group.
O(1) memory per group. Unlike PostgreSQL, which requires a separate GroupAggregate plan node, AxiomDB selects the strategy transparently at execution time.
-- Orders per user
SELECT user_id, COUNT(*) AS order_count, SUM(total) AS revenue
FROM orders
GROUP BY user_id
ORDER BY revenue DESC;
-- Monthly revenue
SELECT
DATE_TRUNC('month', placed_at) AS month,
COUNT(*) AS orders,
SUM(total) AS revenue,
AVG(total) AS avg_order_value
FROM orders
WHERE status != 'cancelled'
GROUP BY DATE_TRUNC('month', placed_at)
ORDER BY month;
HAVING filters groups after aggregation (analogous to WHERE for rows).
-- Only users with more than 5 orders
SELECT user_id, COUNT(*) AS order_count
FROM orders
GROUP BY user_id
HAVING COUNT(*) > 5
ORDER BY order_count DESC;
-- Only categories with average price above 50
SELECT category_id, AVG(price) AS avg_price
FROM products
WHERE deleted_at IS NULL
GROUP BY category_id
HAVING AVG(price) > 50;
ORDER BY
Sorts the result. Multiple columns are sorted left to right.
-- Descending by total, then ascending by name as tiebreaker
SELECT user_id, SUM(total) AS revenue
FROM orders
GROUP BY user_id
ORDER BY revenue DESC, user_id ASC;
NULLS FIRST / NULLS LAST
Controls where NULL values appear in the sort order.
-- Show NULL shipped_at rows at the bottom (unshipped orders last)
SELECT id, total, shipped_at
FROM orders
ORDER BY shipped_at ASC NULLS LAST;
-- Show most recent shipments first; unshipped at top
SELECT id, total, shipped_at
FROM orders
ORDER BY shipped_at DESC NULLS FIRST;
Default behavior: ASC sorts NULL last; DESC sorts NULL first (same as PostgreSQL).
LIMIT and OFFSET
-- First 10 rows
SELECT * FROM products ORDER BY name LIMIT 10;
-- Rows 11-20 (page 2 with page size 10)
SELECT * FROM products ORDER BY name LIMIT 10 OFFSET 10;
-- Common pagination pattern
SELECT * FROM products
ORDER BY created_at DESC
LIMIT 20 OFFSET 40; -- page 3 (0-indexed) of 20 items per page
For large offsets (> 10,000), consider keyset pagination instead:
WHERE id > :last_seen_id ORDER BY id LIMIT 20
INSERT
HeapAppendHint). Repeated INSERTs in the same session no longer walk the
full chain from the root page on every row — the tail is resolved in one page read
and self-healed on mismatch. This eliminates the O(N²) degradation seen at 100K+
rows in a single session.
INSERT … VALUES
Tables whose schema has an explicit PRIMARY KEY now use clustered storage for
SQL-visible INSERT, SELECT, UPDATE, and DELETE. The clustered SQL path now supports:
- single-row
VALUES - multi-row
VALUES INSERT ... SELECTAUTO_INCREMENT- explicit transactions and savepoints
SELECTfull scans over clustered leavesSELECTPK point lookups and PK range scansSELECTsecondary lookups through PK bookmarks stored in the secondary keyUPDATEin-place rewrite when the row still fits in the owning leafUPDATErelocation fallback when the row grows and must be rewritten structurallyUPDATEthrough PK predicates or secondary bookmark probes with transaction rollback/savepoint safetyDELETEthrough PK predicates, PK ranges, secondary bookmark probes, or full clustered scansDELETErollback/savepoint restore through exact clustered row images in WAL
Current clustered boundary after 39.18:
- clustered
DELETEis still delete-mark first, and clusteredVACUUM tableperforms the later physical purge - clustered
VACUUM tablenow frees overflow chains and dead secondary bookmark entries - clustered child-table foreign-key enforcement still remains future work
WITHOUT ROWID more closely than a heap-first compatibility layer would.
When a table has an AUTO_INCREMENT column, omit it from the column list and
AxiomDB generates the next sequential ID automatically. Use LAST_INSERT_ID()
(or the PostgreSQL alias lastval()) immediately after the INSERT to retrieve
the generated value.
CREATE TABLE users (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
name TEXT NOT NULL
);
-- Single row — id is generated automatically
INSERT INTO users (name) VALUES ('Alice');
-- id=1
SELECT LAST_INSERT_ID(); -- returns 1
For multi-row INSERT, LAST_INSERT_ID() returns the ID generated for the
first row of the batch (MySQL semantics). Subsequent rows receive
consecutive IDs.
INSERT ... VALUES statements,
wrap them in BEGIN ... COMMIT. AxiomDB stages consecutive INSERTs
for the same table inside the transaction and flushes them together on
COMMIT or the next barrier statement.
INSERT INTO users (name) VALUES ('Bob'), ('Carol'), ('Dave');
-- ids: 2, 3, 4
SELECT LAST_INSERT_ID(); -- returns 2 (first of the batch)
Supplying an explicit non-NULL value in the AUTO_INCREMENT column bypasses the sequence and does not advance it.
INSERT INTO users (id, name) VALUES (100, 'Eve');
-- id=100; sequence not advanced; next LAST_INSERT_ID() still returns 2
The same AUTO_INCREMENT contract now applies to clustered explicit-PK tables:
AxiomDB bootstraps the next value by scanning the clustered rows for the
current maximum instead of falling back to heap metadata.
See Expressions — Session Functions for
full LAST_INSERT_ID() / lastval() semantics.
-- Single row
INSERT INTO users (name, email, age)
VALUES ('Alice', 'alice@example.com', 30);
-- Multiple rows in one statement (more efficient than individual INSERTs)
INSERT INTO products (name, price, stock) VALUES
('Keyboard', 49.99, 100),
('Mouse', 29.99, 200),
('Monitor', 299.99, 50);
INSERT … DEFAULT VALUES
Inserts a single row using all column defaults. Useful when every column has a default.
CREATE TABLE audit_events (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
event_type TEXT NOT NULL DEFAULT 'unknown'
);
INSERT INTO audit_events DEFAULT VALUES;
-- Row: id=1, created_at=<now>, event_type='unknown'
INSERT … SELECT
Inserts rows generated by a SELECT statement. Useful for bulk copies and migrations.
-- Copy all active users to an archive table
INSERT INTO users_archive (id, email, name, created_at)
SELECT id, email, name, created_at
FROM users
WHERE deleted_at IS NOT NULL;
-- Compute and store aggregates
INSERT INTO monthly_revenue (month, total)
SELECT
DATE_TRUNC('month', placed_at),
SUM(total)
FROM orders
WHERE status = 'delivered'
GROUP BY 1;
UPDATE
Modifies existing rows. All matching rows are updated in a single statement.
UPDATE table_name
SET column = expression [, column = expression ...]
[WHERE condition];
-- Mark a specific order as shipped
UPDATE orders
SET status = 'shipped', shipped_at = CURRENT_TIMESTAMP
WHERE id = 42;
-- Apply a 10% discount to all products in a category
UPDATE products
SET price = price * 0.90
WHERE category_id = 5 AND deleted_at IS NULL;
-- Reset all pending orders older than 7 days to cancelled
UPDATE orders
SET status = 'cancelled'
WHERE status = 'pending'
AND placed_at < CURRENT_TIMESTAMP - INTERVAL '7 days';
An UPDATE without a WHERE clause updates every row in the table. This is rarely what you want. Always double-check before running unbounded updates in production.
DELETE
Removes rows from a table.
DELETE FROM table_name [WHERE condition];
-- Delete a specific row
DELETE FROM sessions WHERE id = 'abc123';
-- Delete all expired sessions
DELETE FROM sessions WHERE expires_at < CURRENT_TIMESTAMP;
-- Soft delete pattern (prefer UPDATE to mark rows inactive)
UPDATE users SET deleted_at = CURRENT_TIMESTAMP WHERE id = 7;
-- Then filter: SELECT * FROM users WHERE deleted_at IS NULL;
DELETE FROM t without a WHERE clause uses a root-rotation fast path
instead of per-row B-Tree deletes. New empty heap and index roots are allocated,
the catalog is updated atomically inside the transaction, and old pages are freed
only after WAL fsync confirms commit durability. This eliminates the 10,000× slowdown
that previously occurred when a table had any index (PK, UNIQUE, or secondary).
The operation is fully transactional: ROLLBACK restores original roots.
When parent FK references exist,
DELETE FROM tkeeps the row-by-row path soRESTRICT/CASCADE/SET NULLFK enforcement still fires correctly.
DELETE ... WHERE col = value or WHERE col > lo uses the available index
to discover candidate rows instead of scanning the full heap. The planner always
prefers the index for DELETE (unlike SELECT, which may reject an index when selectivity
is too low) because avoiding a heap scan is always beneficial even when many rows match.
The full WHERE predicate is rechecked on fetched rows before deletion.
TRUNCATE TABLE
Removes all rows from a table and resets its AUTO_INCREMENT counter to 1.
The table structure, indexes, and constraints are preserved.
TRUNCATE TABLE table_name;
-- Empty a staging table before a fresh import
TRUNCATE TABLE import_staging;
-- After truncate, AUTO_INCREMENT restarts from 1
CREATE TABLE counters (id INT AUTO_INCREMENT PRIMARY KEY, label TEXT);
INSERT INTO counters (label) VALUES ('a'), ('b'); -- ids: 1, 2
TRUNCATE TABLE counters;
INSERT INTO counters (label) VALUES ('c'); -- id: 1 (reset)
TRUNCATE TABLE returns Affected { count: 0 }, matching MySQL convention.
TRUNCATE vs DELETE — when to use each:
DELETE FROM t | TRUNCATE TABLE t | |
|---|---|---|
| Rows removed | All (without WHERE) | All |
| WHERE clause | Supported | Not supported |
| AUTO_INCREMENT | Not reset | Reset to 1 |
| Rows affected | Returns actual count | Returns 0 |
| FK parent table | Row-by-row (enforces FK) | Fails if child FKs exist |
| Typical use | Conditional deletes | Full table wipe |
TRUNCATE TABLE fails with an error if any FK constraint references the table as
the parent. Delete or truncate child tables first, then truncate the parent.
Both DELETE FROM t (no WHERE) and TRUNCATE TABLE t use the same bulk-empty
root-rotation machinery internally and are fully transactional.
Session Variables
Session variables hold connection-scoped state. Read them with SELECT @@name and
change them with SET name = value.
Reading session variables
SELECT @@autocommit; -- 1 (autocommit on) or 0 (autocommit off)
SELECT @@in_transaction; -- 1 inside an active transaction, 0 otherwise
SELECT @@version; -- '8.0.36-AxiomDB-0.1.0'
SELECT @@character_set_client; -- 'utf8mb4'
SELECT @@transaction_isolation; -- 'REPEATABLE-READ'
Supported variables
| Variable | Default | Description |
|---|---|---|
@@autocommit | 1 | 1 = each statement auto-committed; 0 = explicit COMMIT required |
@@axiom_compat | 'standard' | Compatibility mode — controls default session collation (see AXIOM_COMPAT) |
@@collation | 'binary' | Executor-visible text semantics — binary or es (see AXIOM_COMPAT) |
@@in_transaction | 0 | 1 when inside an active transaction, 0 otherwise |
@@on_error | 'rollback_statement' | How statement errors affect the transaction (see ON_ERROR) |
@@version | '8.0.36-AxiomDB-0.1.0' | Server version (MySQL 8 compatible format) |
@@version_comment | 'AxiomDB' | Server variant |
@@character_set_client | 'utf8mb4' | Client character set |
@@character_set_results | 'utf8mb4' | Result character set |
@@collation_connection | 'utf8mb4_general_ci' | Connection collation |
@@max_allowed_packet | 67108864 | Maximum packet size (64 MB) |
@@sql_mode | 'STRICT_TRANS_TABLES' | Active SQL mode (see Strict Mode) |
@@strict_mode | 'ON' | AxiomDB strict coercion flag (alias for STRICT_TRANS_TABLES in sql_mode) |
@@transaction_isolation | 'REPEATABLE-READ' | Isolation level |
Changing session variables
-- Switch to manual transaction mode (used by SQLAlchemy, Django ORM, etc.)
SET autocommit = 0;
SET autocommit = 1; -- restore
-- Character set (accepted for ORM compatibility, utf8mb4 is always used internally)
SET NAMES 'utf8mb4';
SET character_set_client = 'utf8mb4';
-- Control coercion strictness (see Strict Mode below)
SET strict_mode = OFF;
SET sql_mode = '';
@@in_transaction — transaction state check
SELECT @@in_transaction; -- 0 — no transaction active
INSERT INTO t VALUES (1); -- starts implicit txn when autocommit=0
SELECT @@in_transaction; -- 1 — inside transaction
COMMIT;
SELECT @@in_transaction; -- 0 — transaction closed
Use @@in_transaction to verify transaction state before issuing a COMMIT or
ROLLBACK. This avoids the warning generated when COMMIT is called with no
active transaction.
AXIOM_COMPAT and collation
@@axiom_compat controls the high-level compatibility behavior of the session.
@@collation controls how text values are compared, sorted, and grouped.
SET AXIOM_COMPAT = 'mysql'; -- CI+AI text semantics (default collation = 'es')
SET AXIOM_COMPAT = 'postgresql'; -- exact binary text semantics
SET AXIOM_COMPAT = 'standard'; -- default AxiomDB behavior (binary)
SET AXIOM_COMPAT = DEFAULT; -- reset to 'standard'
SET collation = 'es'; -- explicit CI+AI fold for this session
SET collation = 'binary'; -- explicit exact byte order
SET collation = DEFAULT; -- restore compat-derived default
binary collation (default)
Exact byte-order string comparison — current AxiomDB default:
'a' != 'A','a' != 'á'LIKEis case-sensitive and accent-sensitiveGROUP BY,DISTINCT,ORDER BY,MIN/MAX(TEXT)all use raw byte order
es collation — CI+AI fold
A lightweight session-level CI+AI fold: NFC normalize → lowercase → strip combining accent marks. No ICU / CLDR dependency.
'Jose' = 'JOSE' = 'José'compare equalLIKE 'jos%'matchesJoséGROUP BY,DISTINCT,COUNT(DISTINCT ...)collapse accent/case variants into one groupORDER BYsorts by folded text first, raw text as a tie-break for determinismMIN/MAX(TEXT)andGROUP_CONCAT(DISTINCT/ORDER BY ...)respect the fold
-- Binary (default): José and jose are different rows
SELECT name FROM users GROUP BY name;
-- → 'José', 'jose', 'JOSE'
-- Es: all three fold to "jose" — one group
SET AXIOM_COMPAT = 'mysql';
SELECT name FROM users GROUP BY name;
-- → 'José' (or whichever variant appears first)
-- Explicit collation independent of compat mode:
SET collation = 'es';
SELECT * FROM products WHERE name = 'widget'; -- matches Widget, WIDGET, wídget
Index safety: When @@collation = 'es', AxiomDB automatically falls back from text
index lookups to full table scans for correctness. Binary-ordered B-Tree keys do not match
es-folded predicates, so using the index would silently miss rows. Non-text indexes
(INT, BIGINT, DATE, etc.) are unaffected.
Note:
@@collationand@@collation_connectionare separate variables.@@collation_connectionis the transport charset (set during handshake or viaSET NAMES).@@collationis the executor-visible text-comparison behavior added byAXIOM_COMPAT.
Full layered collation (per-database, per-column, ICU locale) is planned for Phase 13.13.
ON_ERROR
@@on_error controls what happens to the current transaction when a statement
fails. It applies to all pipeline stages: parse errors, semantic errors, and
executor errors.
SET on_error = 'rollback_statement'; -- default
SET on_error = 'rollback_transaction';
SET on_error = 'savepoint';
SET on_error = 'ignore';
SET on_error = DEFAULT; -- reset to rollback_statement
Both quoted strings and bare identifiers are accepted:
SET on_error = rollback_statement; -- same as 'rollback_statement'
Modes
rollback_statement (default) — When a statement fails inside an active
transaction, only that statement’s writes are rolled back. The transaction stays
open. This matches MySQL’s statement-level rollback behavior.
BEGIN;
INSERT INTO t VALUES (1); -- ok
INSERT INTO t VALUES (1); -- ERROR: duplicate key
-- transaction still active, id=1 is the only write that will commit
INSERT INTO t VALUES (2); -- ok
COMMIT; -- commits id=1 and id=2
rollback_transaction — When any statement fails inside an active transaction,
the entire transaction is rolled back immediately. @@in_transaction becomes 0.
SET on_error = 'rollback_transaction';
BEGIN;
INSERT INTO t VALUES (1); -- ok
INSERT INTO t VALUES (1); -- ERROR: duplicate key → whole txn rolled back
SELECT @@in_transaction; -- 0 — transaction is gone
ERROR: current transaction is aborted until the client sends ROLLBACK. AxiomDB's rollback_transaction uses eager rollback instead: the transaction is closed immediately on error, so the client starts fresh without needing an explicit ROLLBACK.
savepoint — Same as rollback_statement when a transaction is already
active. When autocommit = 0, the key difference appears on the first DML
in an implicit transaction: savepoint preserves the implicit transaction after
a failing first DML, while rollback_statement closes it.
SET autocommit = 0;
SET on_error = 'savepoint';
INSERT INTO t VALUES (999); -- fails (dup key)
SELECT @@in_transaction; -- 1 — implicit txn stays open
INSERT INTO t VALUES (1); -- ok, continues in the same txn
COMMIT;
ignore — Ignorable SQL errors (parse errors, semantic errors, constraint
violations, type mismatches) are converted to session warnings and the statement
is reported as success. Non-ignorable errors (I/O failures, WAL errors, storage
corruption) still return ERR; if one happens inside an active transaction,
AxiomDB eagerly rolls that transaction back before returning the error.
SET on_error = 'ignore';
INSERT INTO t VALUES (1); -- ok
INSERT INTO t VALUES (1); -- duplicate key → silently ignored
SHOW WARNINGS; -- shows code 1062 + original message
INSERT INTO t VALUES (2); -- ok, continues
COMMIT; -- commits id=1 and id=2
In a multi-statement COM_QUERY, ignore continues executing later statements
after an ignored error.
-- Single COM_QUERY with three statements:
INSERT INTO t VALUES (1); INSERT INTO t VALUES (1); INSERT INTO t VALUES (2);
-- First succeeds, second is ignored (dup), third succeeds.
-- Only the ignored statement's OK packet carries warning_count > 0.
Inspecting the current mode
SELECT @@on_error; -- 'rollback_statement'
SELECT @@session.on_error; -- same
SHOW VARIABLES LIKE 'on_error'; -- on_error | rollback_statement
COM_RESET_CONNECTION resets @@on_error to rollback_statement.
Strict Mode
AxiomDB operates in strict mode by default. In strict mode, an INSERT or
UPDATE that cannot coerce a value to the column’s declared type returns an error
immediately (SQLSTATE 22018). This prevents silent data corruption.
CREATE TABLE products (name TEXT, stock INT);
-- Strict mode (default): error on bad coercion
INSERT INTO products VALUES ('Widget', 'abc');
-- ERROR 22018: cannot coerce 'abc' (Text) to INT
To enable permissive mode, disable strict mode for the session:
SET strict_mode = OFF;
-- or equivalently:
SET sql_mode = '';
In permissive mode, AxiomDB first tries the strict coercion. If it fails, it
falls back to a best-effort conversion (e.g. '42abc' → 42, 'abc' → 0),
stores the result, and emits warning 1265 instead of returning an error:
SET strict_mode = OFF;
CREATE TABLE products (name TEXT, stock INT);
INSERT INTO products VALUES ('Widget', '99abc');
-- Succeeds — stock stored as 99; warning emitted
SHOW WARNINGS;
-- Level Code Message
-- ─────────────────────────────────────────────────────────────────────
-- Warning 1265 Data truncated for column 'stock' at row 1
For multi-row INSERT, the row number in warning 1265 is 1-based and identifies the specific row that triggered the fallback:
INSERT INTO products VALUES ('A', '10'), ('B', '99x'), ('C', '30');
SHOW WARNINGS;
-- Warning 1265 Data truncated for column 'stock' at row 2
Re-enable strict mode at any time:
SET strict_mode = ON;
-- or equivalently:
SET sql_mode = 'STRICT_TRANS_TABLES';
SET strict_mode = DEFAULT also restores the server default (ON).
sql_mode = ''
at connection time to get MySQL 5 permissive behavior. AxiomDB supports this pattern:
SET sql_mode = '' disables strict mode for that connection. Use
SHOW WARNINGS after bulk loads to audit truncated values.
SHOW WARNINGS
After any statement that completes with warnings, query the warning list:
-- Warning from no-op COMMIT
COMMIT; -- no active transaction — emits warning 1592
SHOW WARNINGS;
-- Level Code Message
-- ───────────────────────────────────────────────
-- Warning 1592 There is no active transaction
-- Warning from permissive coercion (strict_mode = OFF)
SET strict_mode = OFF;
INSERT INTO products VALUES ('Widget', '99abc');
SHOW WARNINGS;
-- Level Code Message
-- ─────────────────────────────────────────────────────────────────────
-- Warning 1265 Data truncated for column 'stock' at row 1
SHOW WARNINGS returns the warnings from the most recent statement only. The
list is cleared before each new statement executes.
| Warning Code | Condition |
|---|---|
1265 | Permissive coercion fallback: value was truncated/converted to fit the column type |
1592 | COMMIT or ROLLBACK issued with no active transaction |
SHOW TABLES
Lists all tables in the current schema (or a named schema).
SHOW TABLES;
SHOW TABLES FROM schema_name;
The result set has a single column named Tables_in_<schema>:
SHOW TABLES;
-- Tables_in_public
-- ────────────────
-- users
-- orders
-- products
-- order_items
SHOW COLUMNS / DESCRIBE
Returns the column definitions of a table.
SHOW COLUMNS FROM table_name;
DESCRIBE table_name;
DESC table_name; -- shorthand
All three forms are equivalent. The result has six columns:
| Column | Description |
|---|---|
Field | Column name |
Type | Data type as declared in CREATE TABLE |
Null | YES if the column accepts NULL, NO otherwise |
Key | PRI for primary key columns; empty otherwise (stub) |
Default | Default expression, or NULL if none (stub) |
Extra | auto_increment for AUTO_INCREMENT columns; empty otherwise |
CREATE TABLE users (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
name TEXT NOT NULL,
bio TEXT
);
DESCRIBE users;
-- Field Type Null Key Default Extra
-- ─────────────────────────────────────────────────
-- id BIGINT NO PRI NULL auto_increment
-- name TEXT NO NULL
-- bio TEXT YES NULL
The
KeyandDefaultcolumns are stubs in the current release and do not yet reflect all constraints or computed defaults. Full metadata is tracked internally in the catalog and will be exposed in a future release.
Practical Examples — E-commerce Queries
Checkout: Atomic Order Placement
BEGIN;
-- Verify stock before committing
SELECT stock FROM products WHERE id = 1 AND stock >= 2;
-- If no row returned, rollback
INSERT INTO orders (user_id, total, status)
VALUES (99, 99.98, 'paid');
INSERT INTO order_items (order_id, product_id, quantity, unit_price)
VALUES (LAST_INSERT_ID(), 1, 2, 49.99);
UPDATE products SET stock = stock - 2 WHERE id = 1;
COMMIT;
Revenue Report — Last 30 Days
SELECT
p.name AS product,
SUM(oi.quantity) AS units_sold,
SUM(oi.quantity * oi.unit_price) AS revenue
FROM order_items oi
JOIN orders o ON o.id = oi.order_id
JOIN products p ON p.id = oi.product_id
WHERE o.placed_at >= CURRENT_TIMESTAMP - INTERVAL '30 days'
AND o.status IN ('paid', 'shipped', 'delivered')
GROUP BY p.id, p.name
ORDER BY revenue DESC
LIMIT 10;
User Activity Summary
SELECT
u.id,
u.name,
u.email,
COUNT(o.id) AS total_orders,
SUM(o.total) AS lifetime_value,
MAX(o.placed_at) AS last_order
FROM users u
LEFT JOIN orders o ON o.user_id = u.id AND o.status != 'cancelled'
WHERE u.deleted_at IS NULL
GROUP BY u.id, u.name, u.email
ORDER BY lifetime_value DESC NULLS LAST;
Multi-Statement Queries
AxiomDB accepts multiple SQL statements separated by ; in a single COM_QUERY
call. Each statement executes sequentially, and the client receives one result set
per statement.
-- Three statements in one call
CREATE TABLE IF NOT EXISTS sessions (
id UUID NOT NULL,
user_id INT NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
INSERT INTO sessions (id, user_id) VALUES (gen_random_uuid(), 42);
SELECT COUNT(*) FROM sessions WHERE user_id = 42;
How it works (protocol):
Each intermediate result set is sent with the SERVER_MORE_RESULTS_EXISTS flag
(0x0008) set in the EOF/OK status bytes, telling the client to read the next
result set. The final result set has the flag cleared.
Behavior on error:
If any statement fails, execution stops at that point and an error packet is sent. Statements after the failing one are not executed.
-- If INSERT fails (e.g. UNIQUE violation), SELECT is not executed
INSERT INTO users (email) VALUES ('duplicate@example.com');
SELECT * FROM users WHERE email = 'duplicate@example.com';
mysql CLI, pymysql, and most ORMs handle
multi-statement results automatically when the client flag
CLIENT_MULTI_STATEMENTS is set (default in most clients).
ALTER TABLE — Constraints
ADD CONSTRAINT UNIQUE
-- Named unique constraint (recommended for DROP CONSTRAINT later)
ALTER TABLE users ADD CONSTRAINT uq_users_email UNIQUE (email);
-- Anonymous unique constraint (auto-named)
ALTER TABLE users ADD UNIQUE (username);
ADD CONSTRAINT UNIQUE creates a unique index internally. Fails with
IndexAlreadyExists if a constraint/index with that name already exists on the table,
or UniqueViolation if the column already has duplicate values.
ADD CONSTRAINT CHECK
ALTER TABLE orders ADD CONSTRAINT chk_positive_amount CHECK (amount > 0);
ALTER TABLE products ADD CONSTRAINT chk_stock CHECK (stock >= 0);
The CHECK expression is validated against all existing rows at the time of the
ALTER TABLE. If any row fails the check, the statement returns CheckViolation.
After the constraint is added, every subsequent INSERT and UPDATE on the table
evaluates the expression.
DROP CONSTRAINT
-- Drop by name (works for both UNIQUE and CHECK constraints)
ALTER TABLE users DROP CONSTRAINT uq_users_email;
-- Silent no-op if the constraint does not exist
ALTER TABLE users DROP CONSTRAINT IF EXISTS uq_users_old;
DROP CONSTRAINT searches first in indexes (for UNIQUE constraints), then in the
named constraint catalog (for CHECK constraints).
ADD CONSTRAINT FOREIGN KEY (Phase 6.5)
Adds a foreign key constraint after the table is created. Validates all existing rows before persisting — fails if any existing value violates the new constraint.
ALTER TABLE orders
ADD CONSTRAINT fk_user FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE;
Fails if any existing user_id value has no matching row in users.
Limitations
-- Not yet supported:
ALTER TABLE users ADD CONSTRAINT pk_users PRIMARY KEY (id);
-- → NotImplemented: ADD CONSTRAINT PRIMARY KEY — requires full table rewrite
Prepared Statements — Binary Protocol
AxiomDB supports the full MySQL binary prepared statement protocol, including
large parameter transmission via COM_STMT_SEND_LONG_DATA.
Large parameters (BLOB / TEXT)
When a parameter value is too large to send in a single COM_STMT_EXECUTE
packet, client libraries split it into multiple COM_STMT_SEND_LONG_DATA
chunks before execute. AxiomDB buffers all chunks and assembles the final value
at execute time.
Python (PyMySQL):
import pymysql, os
conn = pymysql.connect(host="127.0.0.1", port=3306, user="root", db="test")
cur = conn.cursor()
cur.execute("CREATE TABLE IF NOT EXISTS files (id INT, data LONGBLOB)")
# PyMySQL automatically uses COM_STMT_SEND_LONG_DATA for values > 8 KB
large_blob = os.urandom(64 * 1024) # 64 KB binary data
cur.execute("INSERT INTO files VALUES (%s, %s)", (1, large_blob))
conn.commit()
Binary parameters (BLOB, LONGBLOB, MEDIUMBLOB, TINYBLOB) are stored
as raw bytes — 0x00 bytes and non-UTF-8 sequences are preserved exactly.
Text parameters (VARCHAR, TEXT, LONGTEXT) are decoded with the
connection’s character_set_client after all chunks are assembled, so multibyte
characters split across chunk boundaries are reconstructed correctly.
Parameter type mapping
| MySQL type | AxiomDB type | Notes |
|---|---|---|
MYSQL_TYPE_STRING / VAR_STRING / VARCHAR | TEXT | UTF-8 decoded |
MYSQL_TYPE_BLOB / TINY_BLOB / MEDIUM_BLOB / LONG_BLOB | BYTES | Raw bytes, no charset |
MYSQL_TYPE_LONG / LONGLONG | INT / BIGINT | |
MYSQL_TYPE_FLOAT / DOUBLE | REAL | |
MYSQL_TYPE_DATE | DATE | |
MYSQL_TYPE_DATETIME | TIMESTAMP |
COM_STMT_RESET
Calling mysql_stmt_reset() (or the equivalent in any MySQL driver) clears any
pending long-data buffers for that statement without deallocating the prepared
statement itself. The statement can then be re-executed with fresh parameters.
SHOW STATUS counter
SHOW STATUS LIKE 'Com_stmt_send_long_data' reports how many long-data chunks
have been received by the current session (session scope) or by the server since
startup (global scope).
SHOW STATUS LIKE 'Com_stmt_send_long_data';
-- Variable_name | Value
-- Com_stmt_send_long_data | 3
Expressions and Operators
An expression is any construct that evaluates to a value. Expressions appear in SELECT projections, WHERE conditions, ORDER BY clauses, CHECK constraints, and DEFAULT values.
Operator Precedence
From highest to lowest binding (higher = evaluated first):
| Level | Operators | Associativity |
|---|---|---|
| 1 | () parentheses | — |
| 2 | Unary -, NOT | Right |
| 3 | *, /, % | Left |
| 4 | +, - | Left |
| 5 | =, <>, !=, <, <=, >, >= | — |
| 6 | IS NULL, IS NOT NULL, BETWEEN, LIKE, IN | — |
| 7 | AND | Left |
| 8 | OR | Left |
Use parentheses to make complex expressions explicit:
-- Without parens: AND binds tighter than OR
SELECT * FROM orders WHERE status = 'paid' OR status = 'shipped' AND total > 100;
-- Parsed as: status = 'paid' OR (status = 'shipped' AND total > 100)
-- Explicit grouping
SELECT * FROM orders WHERE (status = 'paid' OR status = 'shipped') AND total > 100;
Arithmetic Operators
| Operator | Meaning | Example | Result |
|---|---|---|---|
+ | Addition | price + tax | — |
- | Subtraction | stock - sold | — |
* | Multiplication | quantity * unit_price | — |
/ | Division | total / 1.19 | — |
% | Modulo | id % 10 | 0–9 |
Integer division truncates toward zero: 7 / 2 = 3.
Division by zero raises a runtime error (22012 division_by_zero).
SELECT
price,
price * 0.19 AS tax,
price * 1.19 AS price_with_tax,
ROUND(price, 2) AS rounded
FROM products;
Comparison Operators
| Operator | Meaning | NULL behavior |
|---|---|---|
= | Equal | Returns NULL if either operand is NULL |
<>, != | Not equal | Returns NULL if either operand is NULL |
< | Less than | Returns NULL if either operand is NULL |
<= | Less than or equal | Returns NULL if either operand is NULL |
> | Greater than | Returns NULL if either operand is NULL |
>= | Greater than or equal | Returns NULL if either operand is NULL |
SELECT * FROM products WHERE price = 49.99;
SELECT * FROM products WHERE stock <> 0;
SELECT * FROM orders WHERE total >= 100;
Boolean Operators
| Operator | Meaning |
|---|---|
AND | TRUE only if both operands are TRUE |
OR | TRUE if at least one operand is TRUE |
NOT | Negates a boolean value |
NULL Semantics — Three-Valued Logic
AxiomDB implements SQL three-valued logic: every boolean expression evaluates to TRUE, FALSE, or UNKNOWN (which SQL represents as NULL in boolean context). The rules below are critical for writing correct WHERE clauses.
AND truth table
| AND | TRUE | FALSE | UNKNOWN |
|---|---|---|---|
| TRUE | TRUE | FALSE | UNKNOWN |
| FALSE | FALSE | FALSE | FALSE |
| UNKNOWN | UNKNOWN | FALSE | UNKNOWN |
OR truth table
| OR | TRUE | FALSE | UNKNOWN |
|---|---|---|---|
| TRUE | TRUE | TRUE | TRUE |
| FALSE | TRUE | FALSE | UNKNOWN |
| UNKNOWN | TRUE | UNKNOWN | UNKNOWN |
NOT truth table
| NOT | Result |
|---|---|
| TRUE | FALSE |
| FALSE | TRUE |
| UNKNOWN | UNKNOWN |
Key consequences
-- NULL compared to anything is UNKNOWN, not TRUE or FALSE
SELECT NULL = NULL; -- UNKNOWN (NULL, not TRUE)
SELECT NULL <> NULL; -- UNKNOWN
SELECT NULL = 1; -- UNKNOWN
-- WHERE filters only rows where condition is TRUE
-- Rows where the condition is UNKNOWN are excluded
SELECT * FROM users WHERE age = NULL; -- always returns 0 rows!
SELECT * FROM users WHERE age IS NULL; -- correct NULL check
-- UNKNOWN in AND
SELECT * FROM orders WHERE total > 100 AND NULL; -- 0 rows (UNKNOWN is filtered)
-- UNKNOWN in OR
SELECT * FROM orders WHERE total > 100 OR NULL; -- rows where total > 100
IS NULL / IS NOT NULL
These predicates are the correct way to check for NULL. They always return TRUE or FALSE, never UNKNOWN.
-- Find unshipped orders
SELECT * FROM orders WHERE shipped_at IS NULL;
-- Find orders that have been shipped
SELECT * FROM orders WHERE shipped_at IS NOT NULL;
-- Combine with other conditions
SELECT * FROM users WHERE deleted_at IS NULL AND age > 18;
BETWEEN
BETWEEN low AND high is inclusive on both ends. Equivalent to >= low AND <= high.
-- Products priced between $10 and $50 inclusive
SELECT * FROM products WHERE price BETWEEN 10 AND 50;
-- Orders placed in Q1 2026
SELECT * FROM orders
WHERE placed_at BETWEEN '2026-01-01 00:00:00' AND '2026-03-31 23:59:59';
-- NOT BETWEEN
SELECT * FROM products WHERE price NOT BETWEEN 10 AND 50;
LIKE — Pattern Matching
LIKE matches strings against a pattern.
| Wildcard | Meaning |
|---|---|
% | Any sequence of zero or more characters |
_ | Exactly one character |
Pattern matching is case-sensitive by default. Use CITEXT columns or ILIKE
for case-insensitive matching.
-- Emails from example.com
SELECT * FROM users WHERE email LIKE '%@example.com';
-- Names starting with 'Al'
SELECT * FROM users WHERE name LIKE 'Al%';
-- Exactly 5-character codes
SELECT * FROM products WHERE sku LIKE '_____';
-- NOT LIKE
SELECT * FROM users WHERE email NOT LIKE '%@test.%';
-- Escape a literal %
SELECT * FROM products WHERE description LIKE '50\% off' ESCAPE '\';
IN — Membership Test
IN checks whether a value matches any element in a list.
-- Multiple status values
SELECT * FROM orders WHERE status IN ('pending', 'paid', 'shipped');
-- Numeric list
SELECT * FROM products WHERE category_id IN (1, 3, 7);
-- NOT IN
SELECT * FROM orders WHERE status NOT IN ('cancelled', 'refunded');
NOT IN (list)returns UNKNOWN (no rows) if any element in the list is NULL. UseNOT EXISTSor explicit NULL checks when the list may contain NULLs.
-- Safe: explicit list with no NULLs
SELECT * FROM orders WHERE status NOT IN ('cancelled', 'refunded');
-- Dangerous if user_id can be NULL:
SELECT * FROM orders WHERE user_id NOT IN (SELECT id FROM banned_users);
-- If banned_users contains even one NULL user, this returns 0 rows!
-- Safe alternative:
SELECT * FROM orders o
WHERE NOT EXISTS (
SELECT 1 FROM banned_users b WHERE b.id = o.user_id AND b.id IS NOT NULL
);
Scalar Functions
Numeric Functions
| Function | Description | Example |
|---|---|---|
ABS(x) | Absolute value | ABS(-5) → 5 |
CEIL(x) | Ceiling (round up) | CEIL(1.2) → 2 |
FLOOR(x) | Floor (round down) | FLOOR(1.9) → 1 |
ROUND(x, d) | Round to d decimal places | ROUND(3.14159, 2) → 3.14 |
MOD(x, y) | Modulo | MOD(10, 3) → 1 |
POWER(x, y) | x raised to the power y | POWER(2, 8) → 256 |
SQRT(x) | Square root | SQRT(16) → 4 |
String Functions
| Function | Description | Example |
|---|---|---|
LENGTH(s) | Number of bytes | LENGTH('hello') → 5 |
CHAR_LENGTH(s) | Number of UTF-8 characters | CHAR_LENGTH('café') → 4 |
UPPER(s) | Convert to uppercase | UPPER('hello') → 'HELLO' |
LOWER(s) | Convert to lowercase | LOWER('HELLO') → 'hello' |
TRIM(s) | Remove leading and trailing spaces | TRIM(' hi ') → 'hi' |
LTRIM(s) | Remove leading spaces | — |
RTRIM(s) | Remove trailing spaces | — |
SUBSTR(s, pos, len) | Substring from position (1-indexed) | SUBSTR('hello', 2, 3) → 'ell' |
CONCAT(a, b, ...) | Concatenate strings | CONCAT('foo', 'bar') → 'foobar' |
REPLACE(s, from, to) | Replace all occurrences | REPLACE('aabbcc', 'bb', 'X') → 'aaXcc' |
LPAD(s, n, pad) | Pad on the left to length n | LPAD('42', 5, '0') → '00042' |
RPAD(s, n, pad) | Pad on the right to length n | — |
String Concatenation — ||
The || operator concatenates two string values. It is the SQL-standard alternative
to CONCAT() and works in any expression context.
-- Build a full name from two columns
SELECT first_name || ' ' || last_name AS full_name FROM users;
-- Append a suffix
SELECT sku || '-v2' AS new_sku FROM products;
-- NULL propagates: if either operand is NULL the result is NULL
SELECT 'hello' || NULL; -- NULL
Use COALESCE to guard against NULL operands:
SELECT COALESCE(first_name, '') || ' ' || COALESCE(last_name, '') AS full_name
FROM users;
CAST — Explicit Type Conversion
CAST(expr AS type) converts a value to the specified type. Use it when an implicit
coercion would be rejected in strict mode (the default).
-- Text-to-number: always works when the text is a valid number
SELECT CAST('42' AS INT); -- 42
SELECT CAST('3.14' AS REAL); -- 3.14
SELECT CAST('100' AS BIGINT); -- 100
-- Use CAST to store a text literal in a numeric column
INSERT INTO users (age) VALUES (CAST('30' AS INT));
CAST(numeric AS TEXT) — converting an integer or real value to text — is not
supported in the current release and raises 22018 invalid_character_value_for_cast.
Use application-side formatting or wait for Phase 5 (full coercion matrix). The supported
direction is text → number, not number → text.
Supported CAST pairs (Phase 4.16):
| From | To | Notes |
|---|---|---|
TEXT | INT, BIGINT | Entire string must be a valid integer |
TEXT | REAL | Entire string must be a valid float |
TEXT | DECIMAL | Entire string must be a valid decimal |
INT | BIGINT, REAL, DECIMAL | Widening — always succeeds |
BIGINT | REAL, DECIMAL | Widening — always succeeds |
NULL | any | Always returns NULL |
Conditional Functions
| Function | Description |
|---|---|
COALESCE(a, b, ...) | Return first non-NULL argument |
NULLIF(a, b) | Return NULL if a = b, otherwise return a |
IIF(cond, then, else) | Inline if-then-else |
CASE WHEN ... THEN ... END | General conditional expression |
-- COALESCE: display a fallback when the column is NULL
SELECT name, COALESCE(phone, 'N/A') AS contact FROM users;
-- NULLIF: convert 'unknown' to NULL (for aggregate functions to ignore)
SELECT AVG(NULLIF(rating, 0)) AS avg_rating FROM products;
-- CASE: categorize order size
SELECT
id,
total,
CASE
WHEN total < 50 THEN 'small'
WHEN total < 200 THEN 'medium'
WHEN total < 1000 THEN 'large'
ELSE 'enterprise'
END AS order_size
FROM orders;
CASE WHEN — Conditional Expressions
CASE WHEN is a general-purpose conditional expression that can appear anywhere an
expression is valid: SELECT projections, WHERE clauses, ORDER BY, GROUP BY, HAVING,
and as arguments to aggregate functions.
AxiomDB supports two forms: searched CASE (any boolean condition per branch) and simple CASE (equality comparison against a single value).
Searched CASE
Evaluates each WHEN condition left to right and returns the THEN value of the
first condition that is TRUE. If no condition matches and an ELSE is present, the
ELSE value is returned. If no condition matches and there is no ELSE, the result
is NULL.
CASE
WHEN condition1 THEN result1
WHEN condition2 THEN result2
...
[ELSE default_result]
END
-- Categorize orders by total amount
SELECT
id,
total,
CASE
WHEN total < 50 THEN 'small'
WHEN total < 200 THEN 'medium'
WHEN total < 1000 THEN 'large'
ELSE 'enterprise'
END AS order_size
FROM orders;
-- Compute a human-readable status label, including NULL handling
SELECT
id,
CASE
WHEN shipped_at IS NULL AND status = 'paid' THEN 'awaiting shipment'
WHEN shipped_at IS NOT NULL THEN 'shipped'
WHEN status = 'cancelled' THEN 'cancelled'
ELSE 'unknown'
END AS display_status
FROM orders;
Simple CASE
Compares a single expression against a list of values. Equivalent to a searched CASE
using = for each WHEN comparison.
CASE expression
WHEN value1 THEN result1
WHEN value2 THEN result2
...
[ELSE default_result]
END
-- Map status codes to display labels
SELECT
id,
CASE status
WHEN 'pending' THEN 'Pending Payment'
WHEN 'paid' THEN 'Paid'
WHEN 'shipped' THEN 'Shipped'
WHEN 'delivered' THEN 'Delivered'
WHEN 'cancelled' THEN 'Cancelled'
ELSE 'Unknown'
END AS status_label
FROM orders;
NULL Semantics in CASE
In a searched CASE, a WHEN condition that evaluates to UNKNOWN (NULL in boolean
context) is treated the same as FALSE — it does not match, and evaluation continues
to the next branch. This means a NULL condition never triggers a THEN clause.
In a simple CASE, the comparison expression = value uses standard SQL equality,
which returns UNKNOWN when either side is NULL. As a result, WHEN NULL never matches.
Use a searched CASE with IS NULL to handle NULL values explicitly.
-- Simple CASE: WHEN NULL never matches (NULL <> NULL in equality)
SELECT CASE NULL WHEN NULL THEN 'matched' ELSE 'no match' END;
-- Result: 'no match'
-- Correct way to handle NULL in a simple CASE: use searched form
SELECT
CASE
WHEN status IS NULL THEN 'no status'
ELSE status
END AS safe_status
FROM orders;
CASE in ORDER BY — Controlled Sort Order
CASE can produce a sort key that cannot be expressed with a single column reference.
-- Sort orders: unshipped first (status='paid'), then by recency
SELECT id, status, placed_at
FROM orders
ORDER BY
CASE WHEN status = 'paid' AND shipped_at IS NULL THEN 0 ELSE 1 END,
placed_at DESC;
CASE in GROUP BY — Dynamic Grouping
-- Group products by price tier and count items per tier
SELECT
CASE
WHEN price < 25 THEN 'budget'
WHEN price < 100 THEN 'mid-range'
ELSE 'premium'
END AS tier,
COUNT(*) AS product_count,
AVG(price) AS avg_price
FROM products
WHERE deleted_at IS NULL
GROUP BY
CASE
WHEN price < 25 THEN 'budget'
WHEN price < 100 THEN 'mid-range'
ELSE 'premium'
END
ORDER BY avg_price;
Design note: AxiomDB evaluates
CASEexpressions during row processing in the executor’s expression evaluator. Short-circuit evaluation guarantees that branches after the first matchingWHENare never evaluated, which prevents side effects (e.g., division by zero in an unreachable branch).
Date / Time Functions
Current date / time
| Function | Return type | Description |
|---|---|---|
NOW() | TIMESTAMP | Current timestamp (UTC) |
CURRENT_DATE | DATE | Current date (no time) |
CURRENT_TIME | TIMESTAMP | Current time (no date) |
CURRENT_TIMESTAMP | TIMESTAMP | Alias for NOW() |
UNIX_TIMESTAMP() | BIGINT | Current time as Unix seconds |
Date component extractors
| Function | Returns | Description |
|---|---|---|
year(val) | INT | Year (e.g. 2025) |
month(val) | INT | Month 1–12 |
day(val) | INT | Day of month 1–31 |
hour(val) | INT | Hour 0–23 |
minute(val) | INT | Minute 0–59 |
second(val) | INT | Second 0–59 |
DATEDIFF(a, b) | INT | Days between two dates (a - b) |
val accepts DATE, TIMESTAMP, or a text string coercible to a date.
Returns NULL if the input is NULL or not a valid date type.
SELECT year(NOW()), month(NOW()), day(NOW()); -- e.g. 2025, 3, 25
SELECT hour(NOW()), minute(NOW()), second(NOW()); -- e.g. 14, 30, 45
DATE_FORMAT — format a date as text
DATE_FORMAT(ts, format_string) → TEXT
Formats a DATE or TIMESTAMP value using MySQL-compatible format specifiers.
Returns NULL if either argument is NULL or the format string is empty.
| Specifier | Description | Example |
|---|---|---|
%Y | 4-digit year | 2025 |
%y | 2-digit year | 25 |
%m | Month 01–12 | 03 |
%c | Month 1–12 (no pad) | 3 |
%M | Full month name | March |
%b | Abbreviated month name | Mar |
%d | Day 01–31 | 05 |
%e | Day 1–31 (no pad) | 5 |
%H | Hour 00–23 | 14 |
%h | Hour 01–12 (12-hour) | 02 |
%i | Minute 00–59 | 30 |
%s/%S | Second 00–59 | 45 |
%p | AM / PM | PM |
%W | Full weekday name | Tuesday |
%a | Abbreviated weekday | Tue |
%j | Day of year 001–366 | 084 |
%w | Weekday 0=Sun…6=Sat | 2 |
%T | Time HH:MM:SS (24h) | 14:30:45 |
%r | Time HH:MM:SS AM/PM | 02:30:45 PM |
%% | Literal % | % |
Unknown specifiers are passed through literally (%X → %X).
-- Format a stored timestamp as ISO date
SELECT DATE_FORMAT(created_at, '%Y-%m-%d') FROM orders;
-- '2025-03-25'
-- European date format
SELECT DATE_FORMAT(NOW(), '%d/%m/%Y');
-- '25/03/2025'
-- Full datetime
SELECT DATE_FORMAT(NOW(), '%Y-%m-%d %H:%i:%s');
-- '2025-03-25 14:30:45'
-- NULL input → NULL output
SELECT DATE_FORMAT(NULL, '%Y-%m-%d'); -- NULL
STR_TO_DATE — parse a date string
STR_TO_DATE(str, format_string) → DATE | TIMESTAMP | NULL
Parses a text string into a date or timestamp using MySQL-compatible format
specifiers (same table as DATE_FORMAT above).
- Returns
DATEif the format contains only date components. - Returns
TIMESTAMPif the format contains any time components (%H,%i,%s). - Returns
NULLon any parse failure — never raises an error (MySQL behavior). - Returns
NULLif either argument isNULL.
2-digit year rule (%y): 00–69 → 2000–2069; 70–99 → 1970–1999.
-- Parse ISO date → Value::Date
SELECT STR_TO_DATE('2025-03-25', '%Y-%m-%d');
-- Parse European date → Value::Date
SELECT STR_TO_DATE('25/03/2025', '%d/%m/%Y');
-- Parse datetime → Value::Timestamp
SELECT STR_TO_DATE('2025-03-25 14:30:00', '%Y-%m-%d %H:%i:%s');
-- Extract components from a parsed date
SELECT year(STR_TO_DATE('2025-03-25', '%Y-%m-%d')); -- 2025
-- Round-trip: parse then format
SELECT DATE_FORMAT(STR_TO_DATE('2025-03-25', '%Y-%m-%d'), '%d/%m/%Y');
-- '25/03/2025'
-- Invalid date → NULL (Feb 30 does not exist)
SELECT STR_TO_DATE('2025-02-30', '%Y-%m-%d'); -- NULL
-- Bad format → NULL (never an error)
SELECT STR_TO_DATE('not-a-date', '%Y-%m-%d'); -- NULL
FIND_IN_SET — search a comma-separated list
FIND_IN_SET(needle, csv_list) → INT
Returns the 1-indexed position of needle in the comma-separated string
csv_list. Returns 0 if not found. Comparison is case-insensitive.
Returns NULL if either argument is NULL.
SELECT FIND_IN_SET('b', 'a,b,c'); -- 2
SELECT FIND_IN_SET('B', 'a,b,c'); -- 2 (case-insensitive)
SELECT FIND_IN_SET('z', 'a,b,c'); -- 0 (not found)
SELECT FIND_IN_SET('a', ''); -- 0 (empty list)
SELECT FIND_IN_SET(NULL, 'a,b,c'); -- NULL
Useful for querying rows where a column holds a comma-separated tag list:
SELECT * FROM articles WHERE FIND_IN_SET('rust', tags) > 0;
%m means zero-padded month but chrono uses %m differently.
Manual mapping guarantees exact MySQL semantics for all 18 specifiers including
%T, %r, and 2-digit year rules, without risking divergence
from the underlying library's format grammar.
-- DATE_TRUNC and DATE_PART (PostgreSQL-compatible aliases)
SELECT DATE_TRUNC('month', placed_at) AS month, COUNT(*) FROM orders GROUP BY 1;
SELECT DATE_PART('year', created_at) AS signup_year FROM users;
Session Functions
Session functions return state that is specific to the current connection and is not visible to other sessions.
| Function | Return type | Description |
|---|---|---|
LAST_INSERT_ID() | BIGINT | ID generated by the most recent AUTO_INCREMENT INSERT in this session |
lastval() | BIGINT | PostgreSQL-compatible alias for LAST_INSERT_ID() |
version() | TEXT | Server version string, e.g. '8.0.36-AxiomDB-0.1.0' |
current_user() | TEXT | Authenticated username of the current connection |
session_user() | TEXT | Alias for current_user() |
current_database() | TEXT | Name of the current database ('axiomdb') |
database() | TEXT | MySQL-compatible alias for current_database() |
-- Commonly called by ORMs on connect to verify server identity
SELECT version(); -- '8.0.36-AxiomDB-0.1.0'
SELECT current_user(); -- 'root'
SELECT current_database(); -- 'axiomdb'
Semantics:
- Returns
0if noAUTO_INCREMENTINSERT has occurred in the current session. - For a single-row INSERT, returns the generated ID.
- For a multi-row INSERT (
INSERT INTO t VALUES (...), (...), ...), returns the ID generated for the first row of the batch (MySQL semantics). Subsequent rows receive consecutive IDs. - Inserting an explicit non-NULL value into an
AUTO_INCREMENTcolumn does not advance the sequence and does not updateLAST_INSERT_ID(). TRUNCATE TABLEresets the sequence to 1 but does not change the session’sLAST_INSERT_ID()value.
CREATE TABLE items (id BIGINT PRIMARY KEY AUTO_INCREMENT, name TEXT);
-- Single-row INSERT
INSERT INTO items (name) VALUES ('Widget');
SELECT LAST_INSERT_ID(); -- 1
SELECT lastval(); -- 1
-- Multi-row INSERT
INSERT INTO items (name) VALUES ('Gadget'), ('Gizmo'), ('Doohickey');
SELECT LAST_INSERT_ID(); -- 2 (first generated ID in the batch)
-- Explicit value — does not change LAST_INSERT_ID()
INSERT INTO items (id, name) VALUES (99, 'Special');
SELECT LAST_INSERT_ID(); -- still 2
-- Use inside the same statement (e.g., insert a child row)
INSERT INTO orders (user_id, item_id) VALUES (42, LAST_INSERT_ID());
Aggregate Functions
| Function | Description | NULL behavior |
|---|---|---|
COUNT(*) | Count all rows in the group | Includes NULL rows |
COUNT(col) | Count non-NULL values in col | Excludes NULL values |
SUM(col) | Sum of non-NULL values | Returns NULL if all NULL |
AVG(col) | Arithmetic mean of non-NULL values | Returns NULL if all NULL |
MIN(col) | Minimum non-NULL value | Returns NULL if all NULL |
MAX(col) | Maximum non-NULL value | Returns NULL if all NULL |
SELECT
COUNT(*) AS total_rows,
COUNT(email) AS rows_with_email, -- excludes NULL
SUM(total) AS gross_revenue,
AVG(total) AS avg_order_value,
MIN(placed_at) AS first_order,
MAX(placed_at) AS last_order
FROM orders
WHERE status != 'cancelled';
GROUP_CONCAT — String Aggregation
GROUP_CONCAT concatenates non-NULL values across the rows of a group into a single
string. It is MySQL’s most widely-used aggregate function for collecting tags, roles,
categories, and comma-separated lists without a client-side join.
string_agg(expr, separator) is the PostgreSQL-compatible alias.
Syntax
GROUP_CONCAT([DISTINCT] expr [ORDER BY col [ASC|DESC], ...] [SEPARATOR 'str'])
string_agg(expr, separator)
| Clause | Default | Description |
|---|---|---|
DISTINCT | off | Deduplicate values before concatenating |
ORDER BY | none | Sort values within the group before joining |
SEPARATOR | ',' | String inserted between values |
Behavior
- NULL values are skipped — they do not appear in the result and do not add a separator.
- An empty group (no rows) or a group where every value is NULL returns
NULL. - A single value returns that value with no separator added.
- Result is truncated to 1 MB (1,048,576 bytes) maximum.
-- Basic: comma-separated tags per post
SELECT post_id, GROUP_CONCAT(tag ORDER BY tag ASC)
FROM post_tags
GROUP BY post_id;
-- post 1 → 'async,db,rust'
-- post 2 → 'rust,web'
-- post 3 (all NULL tags) → NULL
-- Custom separator
SELECT GROUP_CONCAT(tag ORDER BY tag ASC SEPARATOR ' | ')
FROM post_tags
WHERE post_id = 1;
-- → 'async | db | rust'
-- DISTINCT: deduplicate before joining
SELECT GROUP_CONCAT(DISTINCT tag ORDER BY tag ASC)
FROM tags;
-- Duplicate 'rust' rows → 'async,db,rust' (appears once)
-- string_agg PostgreSQL alias
SELECT string_agg(tag, ', ')
FROM post_tags
WHERE post_id = 2;
-- → 'rust, web' (or 'web, rust' — insertion order)
-- HAVING on a GROUP_CONCAT result
SELECT post_id, GROUP_CONCAT(tag ORDER BY tag ASC) AS tags
FROM post_tags
GROUP BY post_id
HAVING GROUP_CONCAT(tag ORDER BY tag ASC) LIKE '%rust%';
-- Only posts that have the 'rust' tag
-- Collect integers as text
SELECT GROUP_CONCAT(n ORDER BY n ASC) FROM nums;
-- 1, 2, 3 → '1,2,3'
GROUP_CONCAT syntax including DISTINCT,
multi-column ORDER BY, and the SEPARATOR keyword. MySQL codebases
that use GROUP_CONCAT for tags or role lists migrate without modification.
BLOB / Binary Functions
AxiomDB stores binary data as the BLOB / BYTES type and provides functions for
encoding, decoding, and measuring binary values.
| Function | Returns | Description |
|---|---|---|
FROM_BASE64(text) | BLOB | Decode standard base64 → raw bytes. Returns NULL on invalid input. |
TO_BASE64(blob) | TEXT | Encode raw bytes → base64 string. Also accepts TEXT and UUID. |
OCTET_LENGTH(value) | INT | Byte length of a BLOB, TEXT (UTF-8 bytes), or UUID (always 16). |
ENCODE(blob, fmt) | TEXT | Encode bytes as 'base64' or 'hex'. |
DECODE(text, fmt) | BLOB | Decode 'base64' or 'hex' text → raw bytes. |
Usage examples
-- Store binary data encoded as base64
INSERT INTO files (name, data)
VALUES ('logo.png', FROM_BASE64('iVBORw0KGgoAAAANSUhEUgAA...'));
-- Retrieve as base64 for transport
SELECT name, TO_BASE64(data) AS data_b64 FROM files;
-- Check byte size of a blob
SELECT name, OCTET_LENGTH(data) AS size_bytes FROM files;
-- Hex encoding (PostgreSQL / MySQL ENCODE style)
SELECT ENCODE(data, 'hex') FROM files; -- → 'deadbeef...'
SELECT DECODE('deadbeef', 'hex'); -- → binary bytes
-- OCTET_LENGTH vs LENGTH for text
SELECT LENGTH('héllo'); -- 5 (characters)
SELECT OCTET_LENGTH('héllo'); -- 6 (UTF-8 bytes: é = 2 bytes)
TO_BASE64(data) to get a transport-safe string. The client reverses it
with FROM_BASE64() on INSERT. This pattern avoids binary encoding
issues in MySQL wire protocol text mode.
UUID Functions
AxiomDB generates and validates UUIDs server-side. No application-level library needed — the DB handles UUID primary keys directly.
| Function | Returns | Description |
|---|---|---|
gen_random_uuid() | UUID | UUID v4 — 122 random bits. Aliases: uuid_generate_v4(), random_uuid(), newid() |
uuid_generate_v7() | UUID | UUID v7 — 48-bit unix timestamp + random bits. Alias: uuid7() |
is_valid_uuid(text) | BOOL | TRUE if text is a valid UUID string (hyphenated or compact). Alias: is_uuid(). Returns NULL if arg is NULL. |
Usage
-- Auto-generate a UUID primary key at insert time
CREATE TABLE events (
id UUID NOT NULL,
name TEXT NOT NULL
);
INSERT INTO events (id, name)
VALUES (gen_random_uuid(), 'page_view');
-- Use UUID v7 for tables that benefit from time-ordered inserts
INSERT INTO events (id, name)
VALUES (uuid_generate_v7(), 'checkout');
-- Validate an incoming UUID string before inserting
SELECT is_valid_uuid('550e8400-e29b-41d4-a716-446655440000'); -- TRUE
SELECT is_valid_uuid('not-a-uuid'); -- FALSE
SELECT is_valid_uuid(NULL); -- NULL
UUID v4 vs UUID v7 — which to use?
-- UUID v4: fully random, best for security-sensitive IDs
-- Format: xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx (122 random bits)
SELECT gen_random_uuid();
-- → 'f47ac10b-58cc-4372-a567-0e02b2c3d479'
-- UUID v7: time-ordered prefix, best for primary keys on B+ Tree indexes
-- Format: [48-bit ms timestamp]-[12-bit rand]-[62-bit rand]
SELECT uuid_generate_v7();
-- → '018e2e3a-1234-7abc-8def-0123456789ab'
-- ^^^^^^^^^^^ always increasing
AUTO_INCREMENT. For tables receiving hundreds of inserts per second, UUID v7 can be 2-5× faster than v4 for write throughput.
Features
Advanced AxiomDB capabilities beyond basic SQL.
- Transactions — BEGIN, COMMIT, ROLLBACK, SAVEPOINT, MVCC, isolation levels
- Catalog & Schema — system tables, SHOW TABLES, DESCRIBE, introspection queries
- Indexes — B+ Tree indexes, composite indexes, partial indexes, query planning
Transactions
A transaction is a sequence of SQL operations that execute as a single atomic unit: either all succeed (COMMIT) or none of them take effect (ROLLBACK). AxiomDB implements full ACID transactions backed by a Write-Ahead Log and Multi-Version Concurrency Control.
Basic Transaction Control
BEGIN;
-- ... SQL statements ...
COMMIT; -- make all changes permanent
BEGIN;
-- ... SQL statements ...
ROLLBACK; -- undo all changes since BEGIN
Simple Example — Money Transfer
BEGIN;
-- Debit the sender
UPDATE accounts SET balance = balance - 250.00 WHERE id = 1;
-- Credit the receiver
UPDATE accounts SET balance = balance + 250.00 WHERE id = 2;
-- Both succeed together, or neither succeeds
COMMIT;
If the connection drops after the first UPDATE but before COMMIT, the WAL records both the transaction start and the mutation. During crash recovery, AxiomDB sees no COMMIT record for this transaction and discards the partial change. Account 1 keeps its original balance.
Phases 39.11 and 39.12 extend that internal durability model to the
clustered-index storage rewrite: clustered rows now have WAL-backed
rollback/savepoint support and crash recovery by primary key plus exact row
image. Phase 39.13 makes the first SQL-visible clustered cut: CREATE TABLE
with an explicit PRIMARY KEY now creates clustered metadata and a clustered
table root. Phase 39.14 extends that cut to clustered INSERT: SQL writes
now record clustered WAL/undo directly against the clustered PK tree, and
Phase 39.15 opens clustered SELECT over that same storage, Phase
39.16 extends the same transaction contract to clustered UPDATE, and
39.17 now extends it to clustered DELETE as delete-mark plus exact
row-image undo. 39.18 adds clustered VACUUM: once a clustered delete-mark
is old enough to be physically safe, VACUUM table_name purges the dead row,
frees any overflow chain it owned, and cleans dead bookmark entries from
clustered secondary indexes. 39.22 adds zero-allocation in-place UPDATE: when
all SET columns are fixed-size (INT, BIGINT, REAL, BOOL, DATE, TIMESTAMP),
field bytes are patched directly in the page buffer without decoding the row,
and ROLLBACK reverses only the changed bytes via UndoClusteredFieldPatch —
no full row image stored in the undo log.
CREATE TABLE users (id INT PRIMARY KEY, email TEXT UNIQUE);
BEGIN;
INSERT INTO users VALUES (1, 'alice@example.com');
ROLLBACK;
That rollback now restores clustered INSERT, clustered UPDATE, and clustered DELETE state: the clustered base row goes back to its exact previous row image, and any bookmark-bearing secondary entries are deleted or reinserted to match when the statement rewrote them.
Autocommit
When no explicit BEGIN is issued, each statement executes in its own implicit
transaction and is committed automatically on success. This is the default mode.
-- Each of these is its own transaction
INSERT INTO users (name, email) VALUES ('Alice', 'alice@example.com');
INSERT INTO users (name, email) VALUES ('Bob', 'bob@example.com');
To group multiple statements atomically, always use explicit BEGIN ... COMMIT.
SAVEPOINT — Partial Rollback
Savepoints mark a point within a transaction to which you can roll back without aborting the entire transaction. ORMs (Django, Rails, Sequelize) use savepoints internally for partial error recovery.
BEGIN;
INSERT INTO orders (user_id, total) VALUES (1, 99.99);
SAVEPOINT after_order;
INSERT INTO order_items (order_id, product_id, quantity) VALUES (1, 42, 1);
-- Suppose this fails a CHECK constraint
ROLLBACK TO SAVEPOINT after_order;
-- The order row still exists; only the order_item is rolled back
-- Try again with corrected data
INSERT INTO order_items (order_id, product_id, quantity) VALUES (1, 42, 0);
-- Still fails — give up entirely
ROLLBACK;
You can have multiple savepoints with different names:
BEGIN;
SAVEPOINT sp1;
-- ... work ...
SAVEPOINT sp2;
-- ... more work ...
ROLLBACK TO SAVEPOINT sp1; -- undo everything since sp1
RELEASE SAVEPOINT sp1; -- destroy the savepoint (optional cleanup)
COMMIT;
MVCC — Multi-Version Concurrency Control
AxiomDB uses MVCC plus a server-side Arc<RwLock<Database>>.
Today that means:
- read-only statements (
SELECT,SHOW, metadata queries) run concurrently - mutating statements (
INSERT,UPDATE,DELETE, DDL,BEGIN/COMMIT/ROLLBACK) are serialized at whole-database granularity - a read that is already running keeps its snapshot while another session commits
- row-level locking, deadlock detection, and
SELECT ... FOR UPDATEare planned for Phases 13.7, 13.8, and 13.8b
This is good for read-heavy workloads, but it is still below MySQL/InnoDB and PostgreSQL for write concurrency because they already lock at row granularity.
How It Works
When a transaction starts, it receives a snapshot — a consistent view of the database as it existed at that moment. Other transactions may commit new changes while your transaction runs, but your snapshot does not change.
Time →
Txn A (snapshot at T=100): BEGIN → reads → reads → COMMIT
| | |
Txn B: | INSERT | COMMIT |
| | |
Txn A sees the world as it was at T=100.
Txn B's inserts are not visible to Txn A.
This is implemented via the Copy-on-Write B+ Tree: when Txn B writes a page, it creates a new copy rather than overwriting the original. Txn A holds a pointer to the old root and continues reading the old version. When Txn A commits, the old pages become eligible for reclamation.
No Per-Page Read Latches
Readers access immutable snapshots and owned page copies, so they do not take
per-page latches in the storage layer. The current server runtime still uses a
database-wide RwLock, so the real guarantee today is:
- many reads can run together
- writes do not run in parallel with other writes
Current Write Behavior
Two sessions do not currently mutate different rows in parallel. Instead,
the server queues mutating statements behind the database-wide write guard.
lock_timeout applies to that wait today.
This means you should not yet build on assumptions such as:
- row-level deadlock detection
40001 serialization_failureretries for ordinary write-write conflictsSELECT ... FOR UPDATE/SKIP LOCKEDjob-queue patterns
Those behaviors are planned, but not implemented yet.
Isolation Levels
AxiomDB currently accepts three wire-visible isolation names:
READ COMMITTEDREPEATABLE READ(session default)SERIALIZABLE
READ COMMITTED and REPEATABLE READ have distinct snapshot behavior today.
SERIALIZABLE is accepted and stored, but currently uses the same frozen-snapshot
policy as REPEATABLE READ; true SSI is still planned.
READ COMMITTED
Each statement within the transaction sees data committed before that statement began. A second SELECT within the same transaction may see different data if another transaction committed between the two SELECTs.
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
BEGIN;
SELECT balance FROM accounts WHERE id = 1; -- sees T=100: balance = 1000
-- Txn B commits: UPDATE accounts SET balance = 900 WHERE id = 1
SELECT balance FROM accounts WHERE id = 1; -- sees T=110: balance = 900 (changed!)
COMMIT;
Use READ COMMITTED when:
- You need maximum concurrency
- Each statement needing the freshest possible data is acceptable
- You are running analytics that can tolerate non-repeatable reads
REPEATABLE READ (default)
The entire transaction sees the snapshot from the moment BEGIN was executed.
No matter how many other transactions commit, your reads return the same data.
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
BEGIN;
SELECT balance FROM accounts WHERE id = 1; -- snapshot at T=100: balance = 1000
-- Txn B commits: UPDATE accounts SET balance = 900 WHERE id = 1
SELECT balance FROM accounts WHERE id = 1; -- still sees T=100: balance = 1000
COMMIT;
Use REPEATABLE READ when:
- You need consistent data across multiple reads in one transaction
- Running reports or multi-step calculations where consistency matters
- Implementing optimistic locking patterns
Isolation Level Comparison
| Phenomenon | READ COMMITTED | REPEATABLE READ |
|---|---|---|
| Dirty reads | Never | Never |
| Non-repeatable reads | Possible | Never |
| Phantom reads | Possible | Prevented by current single-writer runtime |
| Concurrent writes | Serialized globally | Serialized globally |
SERIALIZABLE
SERIALIZABLE is accepted for MySQL/PostgreSQL compatibility, but today it
uses the same frozen snapshot as REPEATABLE READ. The engine does not yet
run Serializable Snapshot Isolation conflict tracking.
E-commerce Checkout — Current Safe Pattern
Until row-level locking lands, the supported stock-reservation pattern is a
guarded UPDATE ... WHERE stock >= ? plus affected-row checks.
BEGIN;
-- Reserve stock atomically; application checks that each UPDATE affects 1 row.
UPDATE products SET stock = stock - 2 WHERE id = 1 AND stock >= 2;
UPDATE products SET stock = stock - 1 WHERE id = 3 AND stock >= 1;
-- Create the order header
INSERT INTO orders (user_id, total, status)
VALUES (99, 149.97, 'paid');
-- Create order items
INSERT INTO order_items (order_id, product_id, quantity, unit_price) VALUES
(LAST_INSERT_ID(), 1, 2, 49.99),
(LAST_INSERT_ID(), 3, 1, 49.99);
COMMIT;
If any step fails (constraint violation, connection drop, server crash), the WAL ensures the entire transaction is rolled back on recovery.
Transaction Performance Tips
- Keep transactions short. Long-running transactions hold MVCC versions in memory longer, increasing memory pressure.
- Avoid user interaction within a transaction. Never open a transaction and wait for a user to click a button.
- For bulk inserts into clustered tables, wrap all rows in a single
BEGIN ... COMMITblock. Phase 40.1 introducesClusteredInsertBatch: rows are staged in memory, sorted by primary key, and flushed at COMMIT using the rightmost-leaf batch append path. This reduces O(N) CoW page-clone operations to O(N / leaf_capacity) page writes — delivering 55.9K rows/s for 50K sequential PK rows vs MySQL 8.0 InnoDB’s ~35K rows/s (+59%). - For bulk loads, consider committing every 50,000–100,000 rows to limit WAL growth while keeping the batch-insert speedup.
WAL Fsync Pipeline — Current Server Commit Path
Every durable DML commit still needs WAL fsync, but AxiomDB no longer relies on the old timer-based group-commit window for batching. The server now uses an always-on leader-based fsync pipeline:
- one connection becomes the fsync leader
- later commits queue behind that leader if their WAL entry is already buffered
- if the leader’s fsync covers a later commit’s LSN, that later commit returns without paying another fsync
Catalog and Schema Introspection
AxiomDB maintains an internal catalog that records logical databases, tables, columns, and indexes. The catalog is persisted in system heaps rooted from the meta page and is exposed through convenience commands plus catalog-backed SQL resolution.
Databases
Fresh databases always bootstrap a default logical database named axiomdb.
Existing databases created before multi-database support are upgraded lazily on
open and their legacy tables remain owned by axiomdb.
SHOW DATABASES;
Example output:
| Database |
|---|
| axiomdb |
| analytics |
CREATE DATABASE analytics;
USE analytics;
SELECT DATABASE();
Expected result:
| DATABASE() |
|---|
| analytics |
CREATE DATABASE existed remain visible under
the default database axiomdb. You do not need to rewrite old table names
just to adopt SHOW DATABASES and USE.
System Tables
The catalog exposes six system tables in the axiom schema. They are always readable
without any special privileges.
| Table | Purpose |
|---|---|
axiom_tables | One row per user table |
axiom_columns | One row per column |
axiom_indexes | One row per index (logical metadata; clustered PK rows may reuse the table root) |
axiom_constraints | Named CHECK constraints |
axiom_foreign_keys | FK constraint definitions |
axiom_stats | Per-column NDV and row_count for the query planner |
axiom_tables
Contains one row per user-visible table.
Phase 39.13 adds physical-layout metadata to these rows even though the
introspection surface is still being expanded. The important rule today is:
- explicit
PRIMARY KEYtable → clustered table root - no explicit
PRIMARY KEY→ heap table root
The catalog now keeps that distinction even before clustered DML is exposed.
| Column | Type | Description |
|---|---|---|
id | BIGINT | Internal table identifier (table_id) |
schema_name | TEXT | Schema name (public by default) |
table_name | TEXT | Name of the table |
column_count | INT | Number of columns |
created_at | BIGINT | LSN at which the table was created |
-- List all user tables
SELECT schema_name, table_name, column_count
FROM axiom_tables
ORDER BY schema_name, table_name;
axiom_columns
Contains one row per column, in declaration order.
| Column | Type | Description |
|---|---|---|
table_id | BIGINT | Foreign key → axiom_tables.id |
table_name | TEXT | Denormalized table name for convenience |
col_index | INT | Zero-based position within the table |
col_name | TEXT | Column name |
data_type | TEXT | SQL type name (e.g., TEXT, BIGINT, DECIMAL) |
not_null | BOOL | TRUE if declared NOT NULL |
default_value | TEXT | DEFAULT expression as a string, or NULL if none |
-- All columns of the orders table
SELECT col_index, col_name, data_type, not_null, default_value
FROM axiom_columns
WHERE table_name = 'orders'
ORDER BY col_index;
axiom_indexes
Contains one row per index (including automatically generated PK and UNIQUE indexes).
| Column | Type | Description |
|---|---|---|
id | BIGINT | Internal index identifier |
table_id | BIGINT | Foreign key → axiom_tables.id |
table_name | TEXT | Denormalized table name |
index_name | TEXT | Index name |
is_unique | BOOL | TRUE for UNIQUE and PRIMARY KEY indexes |
is_primary | BOOL | TRUE for the PRIMARY KEY index |
columns | TEXT | Comma-separated list of indexed column names |
root_page_id | BIGINT | Page ID of the index root; clustered PRIMARY KEY metadata reuses the table root |
-- All indexes on the products table
SELECT index_name, is_unique, is_primary, columns
FROM axiom_indexes
WHERE table_name = 'products'
ORDER BY is_primary DESC, index_name;
Convenience Commands
SHOW DATABASES
Lists all logical databases persisted in the catalog.
SHOW DATABASES;
USE
Changes the selected database for the current connection. Unqualified table names are resolved inside that database.
USE analytics;
SHOW TABLES;
If the database does not exist, AxiomDB returns MySQL error 1049:
USE missing_db;
-- ERROR 1049 (42000): Unknown database 'missing_db'
SHOW TABLES
Lists all tables in the current schema.
SHOW TABLES;
Example output:
| Table name |
|---|
| accounts |
| order_items |
| orders |
| products |
| users |
SHOW TABLES LIKE
Filters by a LIKE pattern.
SHOW TABLES LIKE 'order%';
| Table name |
|---|
| order_items |
| orders |
DESCRIBE (or DESC)
Shows the column structure of a table.
DESCRIBE users;
-- or:
DESC products;
Example output:
| Column | Type | Null | Key | Default |
|---|---|---|---|---|
| id | BIGINT | NO | PRI | AUTO_INCREMENT |
| TEXT | NO | UNI | ||
| name | TEXT | NO | ||
| age | INT | YES | ||
| created_at | TIMESTAMP | NO | CURRENT_TIMESTAMP |
Introspection Queries
Because the catalog is exposed as regular tables, you can write arbitrary SQL against it.
Find all NOT NULL columns across all tables
SELECT table_name, col_name, data_type
FROM axiom_columns
WHERE not_null = TRUE
ORDER BY table_name, col_index;
Find tables with no indexes
SELECT t.table_name
FROM axiom_tables t
LEFT JOIN axiom_indexes i ON i.table_id = t.id
WHERE i.id IS NULL
ORDER BY t.table_name;
Find foreign key columns that lack an index
-- Assumes FK columns follow the naming convention: <table>_id
SELECT c.table_name, c.col_name
FROM axiom_columns c
LEFT JOIN axiom_indexes i
ON i.table_id = c.table_id
AND i.columns LIKE c.col_name || '%'
WHERE c.col_name LIKE '%_id'
AND c.col_name <> 'id'
AND i.id IS NULL
ORDER BY c.table_name, c.col_name;
Column count per table
SELECT table_name, column_count
FROM axiom_tables
ORDER BY column_count DESC;
Catalog Bootstrap
The catalog is bootstrapped on the very first open() call. AxiomDB allocates the
catalog roots, inserts the default database axiomdb, and makes the catalog durable
before the database accepts traffic. Subsequent opens detect the initialized roots
and skip the bootstrap path.
The bootstrap is idempotent: if AxiomDB crashes during bootstrap, the incomplete
transaction has no COMMIT record in the WAL, so crash recovery discards it and
the next open() re-runs the bootstrap from scratch.
Schema Visibility Rules
The default schema is public. All tables created without an explicit schema prefix
belong to public. System tables live in the axiom schema and are always visible.
-- These are equivalent if the default schema is 'public'
CREATE TABLE users (...);
CREATE TABLE public.users (...);
-- System tables require the axiom. prefix or are accessible without schema
SELECT * FROM axiom_tables; -- works
SELECT * FROM axiom.axiom_tables; -- also works
Indexes
Indexes are B+ Tree data structures that allow AxiomDB to find rows matching a
condition without scanning the entire table. Every index is a Copy-on-Write B+ Tree
stored in the same .db file as the table data.
Current Storage Model
Today AxiomDB exposes two SQL-visible table layouts:
- tables without an explicit
PRIMARY KEYstill use the classic heap + index path - tables with an explicit
PRIMARY KEYnow bootstrap clustered storage atCREATE TABLEtime
That new clustered SQL boundary is now wider through 39.18:
- the table root is clustered from day one
- PRIMARY KEY catalog metadata points at that clustered root
INSERTon clustered tables now works through the clustered PK treeSELECTon clustered tables now works through the clustered PK tree and clustered secondary bookmarksUPDATEon clustered tables now rewrites rows directly in the clustered PK treeDELETEon clustered tables now applies clustered delete-mark through the clustered PK treeVACUUM table_nameon clustered tables now physically purges safe dead rows, frees overflow chains, and cleans dead secondary bookmarksALTER TABLE legacy_table REBUILDnow migrates legacy heap+PRIMARY KEY tables into clustered layout and rebuilds secondary indexes as PK-bookmark indexes
Internally, the storage rewrite already has clustered insert, point lookup,
range scan, same-leaf update, delete-mark, structural rebalance / relocate-update,
secondary PK bookmarks, and overflow-backed clustered rows for large payloads,
and explicit-PK CREATE TABLE now records that layout in SQL metadata.
Phase 39.14 made the first executor-visible clustered write cut, 39.15
opened the read side, 39.16 brought UPDATE onto that same clustered path,
and 39.17 now does the same for logical clustered DELETE: PK lookups/ranges,
clustered secondary bookmark probes, in-place delete-mark, and rollback-safe
WAL all stay on clustered storage. 39.18 closes the first clustered
maintenance slice too: VACUUM now purges physically dead clustered cells and
their overflow/secondary debris instead of leaving clustered cleanup as a
future-only promise.
That internal rewrite is still honest about its current boundary:
- relocate-update rewrites only the current inline version
- clustered delete is still delete-mark first, then later
VACUUM - large clustered rows can already spill to overflow pages internally, but SQL only explicit-PK tables expose clustered layout at DDL time
- clustered covering reads still degrade to fetching the clustered row body; a true clustered index-only optimization is still future work
- clustered child-table foreign-key enforcement still remains future work
WITHOUT ROWID and InnoDB both treat the clustered key as the row-storage identity. AxiomDB now does the same for SQL-visible clustered INSERT: no heap fallback row is created, and non-primary indexes store PK bookmarks instead of heap-era RecordId payloads.
Index Statistics and Query Planner
AxiomDB maintains per-column statistics to help the query planner choose between an index scan and a full table scan.
How it works
When you create an index, AxiomDB automatically computes:
row_count— total visible rows in the tablendv(number of distinct values) — exact count of distinct non-NULL values
The planner uses selectivity = 1 / ndv for equality predicates. If
selectivity > 20% of rows would be returned, a full table scan is cheaper
than an index scan, so the planner uses the table scan.
ndv = 3, rows = 10,000 → selectivity = 33% > 20% → Scan
ndv = 100, rows = 10,000 → selectivity = 1% < 20% → Index
ANALYZE command
Run ANALYZE to refresh statistics after bulk inserts or deletes:
-- Analyze a specific table (all indexed columns)
ANALYZE TABLE users;
-- Analyze a specific column only
ANALYZE TABLE orders (status);
Statistics are automatically computed at CREATE INDEX time. Run ANALYZE when:
- Significant data was added after the index was created
- Query plans seem wrong (e.g., full scan when index would be faster)
Automatic staleness detection
After enough row changes (>20% of the analyzed row count), the planner
automatically uses conservative defaults (ndv = 200) until the next ANALYZE.
This prevents stale statistics from causing poor query plans.
Composite Indexes
A composite index covers two or more columns. The query planner uses it when the WHERE clause contains equality conditions on the leading columns.
CREATE INDEX idx_user_status ON orders(user_id, status);
-- Uses composite index: both leading columns matched
SELECT * FROM orders WHERE user_id = 42 AND status = 'active';
-- Also uses index via prefix scan: leading column only
SELECT * FROM orders WHERE user_id = 42;
-- Does NOT use index: leading column absent from WHERE
SELECT * FROM orders WHERE status = 'active';
Fill Factor
Fill factor controls how full a B-Tree leaf page is allowed to get before it splits. A lower fill factor leaves intentional free space on each page, reducing split frequency for workloads that add rows after index creation.
-- Append-heavy time-series table: pages fill to 70% before splitting.
CREATE INDEX idx_ts ON events(created_at) WITH (fillfactor = 70);
-- Compact read-only index: fill pages completely.
CREATE UNIQUE INDEX uq_email ON users(email) WITH (fillfactor = 100);
-- Default (90%) — equivalent to omitting WITH:
CREATE INDEX idx_x ON t(x);
Range and default
Valid range: 10–100. Default: 90 (matches PostgreSQL’s
BTREE_DEFAULT_FILLFACTOR). fillfactor = 100 reproduces the current behavior
exactly — pages fill completely before splitting.
Effect on splits
With fillfactor = F:
- Leaf page splits when it reaches
⌈F × ORDER_LEAF / 100⌉entries (instead of at full capacity). - After a split, both new pages hold roughly
F/2 %of capacity — leaving room for future inserts without triggering another split. - Internal pages always fill to capacity (not user-configurable).
Automatic Indexes
AxiomDB automatically creates a unique B+ Tree index for:
- Every
PRIMARY KEYdeclaration - Every
UNIQUEcolumn constraint orUNIQUEtable constraint
For clustered tables, the automatically created PRIMARY KEY metadata row reuses
the clustered table root instead of allocating a second heap-era PK tree.
UNIQUE secondary indexes still allocate ordinary B+ Tree roots, but 39.14
now maintains their entries as secondary_key ++ pk_suffix bookmarks during
SQL-visible clustered INSERT.
Multi-row INSERT on Indexed Tables
Multi-row INSERT ... VALUES (...), (... ) statements now stay on a grouped
heap/index path even when the target table already has a PRIMARY KEY or
secondary indexes.
INSERT INTO users VALUES
(1, 'a@example.com'),
(2, 'b@example.com'),
(3, 'c@example.com');
This matters because indexed tables used to fall back to per-row maintenance on this workload. The grouped path keeps the same SQL-visible behavior:
- duplicate
PRIMARY KEY/UNIQUEvalues inside the same statement still fail - a failed multi-row statement does not leak partial committed rows
- partial indexes still include only rows whose predicate matches
Startup Integrity Verification
When a database opens, AxiomDB verifies every catalog-visible index against the heap-visible rows reconstructed after WAL recovery.
- If the tree is readable but its contents diverge from the heap, AxiomDB rebuilds the index automatically from table contents before serving traffic.
- If the tree cannot be traversed safely, open fails with
IndexIntegrityFailureinstead of guessing.
This check applies to both embedded mode and server mode because both call the same startup verifier.
amcheck's “never trust an unreadable B-Tree” rule
with SQLite's REINDEX-style rebuild-from-table approach. Readable divergence is
healed automatically from heap data; unreadable trees still block open.
Creating Indexes Manually
CREATE [UNIQUE] INDEX index_name ON table_name (col1 [ASC|DESC], col2 ...);
CREATE INDEX idx_users_name ON users (name);
CREATE INDEX idx_orders_user ON orders (user_id, placed_at DESC);
CREATE UNIQUE INDEX uq_sku ON products (sku);
See DDL — CREATE INDEX for the full syntax.
When Indexes Help
The query planner considers an index when:
- The leading column(s) of the index appear in a
WHEREequality or range condition. - The index columns match the
ORDER BYdirection and order (avoids a sort step). - The index is selective enough that scanning it is cheaper than a full table scan.
-- Good: leading column (user_id) used in WHERE
CREATE INDEX idx_orders_user ON orders (user_id, placed_at DESC);
SELECT * FROM orders WHERE user_id = 42 ORDER BY placed_at DESC;
-- Bad: leading column not in WHERE — index not used
SELECT * FROM orders WHERE placed_at > '2026-01-01';
-- Solution: create a separate index on placed_at
CREATE INDEX idx_orders_date ON orders (placed_at);
Composite Index Column Order
The order of columns in a composite index determines which query patterns it
accelerates. The B+ Tree is sorted by the concatenated key (col1, col2, ...).
CREATE INDEX idx_orders_user_status ON orders (user_id, status);
This index accelerates:
WHERE user_id = 42WHERE user_id = 42 AND status = 'paid'
This index does NOT accelerate:
WHERE status = 'paid'(leading column not constrained)
Rule of thumb: put the highest-selectivity, most frequently filtered column first.
Partial Indexes
A partial index covers only the rows matching a WHERE predicate. This reduces index
size and maintenance cost.
-- Index only pending orders (the common access pattern)
CREATE INDEX idx_pending_orders ON orders (user_id)
WHERE status = 'pending';
-- Index only non-deleted users
CREATE INDEX idx_active_users ON users (email)
WHERE deleted_at IS NULL;
The query planner uses a partial index only when the query’s WHERE clause implies the index’s predicate.
Index Key Size Limit
The B+ Tree stores encoded keys up to 768 bytes. For most column types this is never an issue:
INT,BIGINT,UUID,TIMESTAMP— fixed-size, always well under the limit.TEXT,VARCHAR— a 760-character value will just fit. If you index a column with very long strings (> 750 characters), rows exceeding the limit are silently skipped atCREATE INDEXtime and returnIndexKeyTooLongon INSERT.
Query Planner — Phase 6.3
The planner rewrites the execution plan before running the scan. Currently recognized patterns:
Equality lookup — exact match on the leading indexed column:
-- Uses B-Tree point lookup (O(log n) instead of O(n))
SELECT * FROM users WHERE email = 'alice@example.com';
SELECT * FROM orders WHERE id = 42;
This includes the PRIMARY KEY. A query like WHERE id = 42 does not need a
redundant secondary index on id.
Range scan — upper and lower bound on the leading indexed column:
-- Uses B-Tree range scan
SELECT * FROM orders WHERE created_at > '2024-01-01' AND created_at < '2025-01-01';
SELECT * FROM products WHERE price >= 10.0 AND price <= 50.0;
Full scan fallback — any pattern not recognized above:
-- Falls back to full table scan (no index for OR, function, or non-leading column)
SELECT * FROM users WHERE email LIKE '%gmail.com';
SELECT * FROM orders WHERE status = 'paid' OR total > 1000;
Partial Indexes
A partial index covers only the rows matching a WHERE predicate. This reduces index size, speeds up maintenance, and — for UNIQUE indexes — restricts uniqueness enforcement to the matching subset.
-- Only active users need unique emails.
CREATE UNIQUE INDEX uq_active_email ON users(email) WHERE deleted_at IS NULL;
-- Index only pending orders for fast user lookups.
CREATE INDEX idx_pending ON orders(user_id) WHERE status = 'pending';
Partial UNIQUE indexes
The uniqueness constraint applies only among rows satisfying the predicate. Rows that do not satisfy the predicate are never inserted into the index.
-- alice deleted, then re-created: no conflict.
INSERT INTO users VALUES (1, 'alice@x.com', '2025-01-01'); -- deleted
INSERT INTO users VALUES (2, 'alice@x.com', NULL); -- active ✅
INSERT INTO users VALUES (3, 'alice@x.com', NULL); -- ❌ UniqueViolation
INSERT INTO users VALUES (4, 'alice@x.com', '2025-06-01'); -- deleted ✅
Planner support
The planner uses a partial index only when the query’s WHERE clause implies the index predicate. If the implication cannot be verified, the planner falls back to a full scan or a full index — always correct.
-- Uses partial index (WHERE contains `deleted_at IS NULL`):
SELECT * FROM users WHERE email = 'alice@x.com' AND deleted_at IS NULL;
-- Falls back to full scan (predicate not in WHERE):
SELECT * FROM users WHERE email = 'alice@x.com';
WHERE deleted_at IS
NULL) is typically 10–100× smaller than a full unique index, since most
rows in high-churn tables are in the deleted state. This reduces build time,
per-INSERT maintenance cost, and bloom filter memory. MySQL InnoDB does not
support partial indexes, so this optimization is not available there.
Foreign Key Constraints
Foreign key constraints ensure referential integrity between tables. Every non-NULL value in the FK column of the child table must reference an existing row in the parent table.
-- Inline REFERENCES syntax
CREATE TABLE orders (
id INT PRIMARY KEY,
user_id INT REFERENCES users(id) ON DELETE CASCADE
);
-- Table-level FOREIGN KEY syntax
CREATE TABLE order_items (
id INT PRIMARY KEY,
order_id INT,
product_id INT,
CONSTRAINT fk_order FOREIGN KEY (order_id) REFERENCES orders(id) ON DELETE CASCADE,
CONSTRAINT fk_product FOREIGN KEY (product_id) REFERENCES products(id) ON DELETE RESTRICT
);
-- Add FK after the fact
ALTER TABLE orders
ADD CONSTRAINT fk_user FOREIGN KEY (user_id) REFERENCES users(id);
-- Remove a FK constraint
ALTER TABLE orders DROP CONSTRAINT fk_user;
ON DELETE Actions
| Action | Behavior |
|---|---|
RESTRICT / NO ACTION (default) | Error if child rows reference the deleted parent row |
CASCADE | Automatically delete all child rows (recursive, max depth 10) |
SET NULL | Set child FK column to NULL (column must be nullable) |
Enforcement Examples
CREATE TABLE users (id INT PRIMARY KEY, email TEXT);
CREATE TABLE orders (id INT PRIMARY KEY, user_id INT REFERENCES users(id) ON DELETE CASCADE);
INSERT INTO users VALUES (1, 'alice@x.com');
INSERT INTO orders VALUES (10, 1); -- ✅ user 1 exists
-- INSERT with missing parent → error
INSERT INTO orders VALUES (20, 999);
-- ERROR 23503: Foreign key constraint fails: 'orders.user_id' = '999'
-- DELETE parent with CASCADE → child rows automatically deleted
DELETE FROM users WHERE id = 1;
SELECT COUNT(*) FROM orders; -- → 0 (orders were cascaded)
-- DELETE parent with RESTRICT (default) → blocked if children exist
CREATE TABLE invoices (id INT PRIMARY KEY, order_id INT REFERENCES orders(id));
INSERT INTO users VALUES (2, 'bob@x.com');
INSERT INTO orders VALUES (30, 2);
INSERT INTO invoices VALUES (1, 30);
DELETE FROM orders WHERE id = 30;
-- ERROR 23503: foreign key constraint "fk_invoices_order_id": invoices.order_id references this row
NULL FK Values
A NULL value in a FK column is always allowed — it does not reference any parent row. This follows SQL standard MATCH SIMPLE semantics.
INSERT INTO orders VALUES (99, NULL); -- ✅ NULL user_id is always allowed
ON UPDATE
Only ON UPDATE RESTRICT (the default) is enforced. Updating a parent key while
child rows reference it is rejected. ON UPDATE CASCADE and ON UPDATE SET NULL
are planned for Phase 6.10.
Current Limitations
- Only single-column FKs are supported. Composite FKs —
FOREIGN KEY (a, b) REFERENCES t(x, y)— are planned for Phase 6.10. ON UPDATE CASCADE/ON UPDATE SET NULLare planned for Phase 6.10.- FK validation uses B-Tree range scans via the FK auto-index (Phase 6.9). Falls back to full table scan for pre-6.9 FKs.
Bloom Filter Optimization
AxiomDB maintains an in-memory Bloom filter for each secondary index. The filter allows the query executor to skip B-Tree page reads entirely when a lookup key is definitively absent from the index.
How It Works
When the planner chooses an index lookup for a WHERE col = value condition,
the executor checks the Bloom filter before touching the B-Tree:
- Filter says no → key is 100% absent. Zero B-Tree pages read. Empty result returned immediately.
- Filter says maybe → normal B-Tree lookup proceeds.
The filter is a probabilistic data structure: it never produces false negatives (a key that exists will always get a “maybe”), but can produce false positives (a key that does not exist may occasionally get a “maybe” instead of “no”). The false positive rate is tuned to 1% — at most 1 in 100 absent-key lookups will still read the B-Tree.
Lifecycle
| Event | Effect on Bloom filter |
|---|---|
CREATE INDEX | Filter created and populated with all existing keys |
INSERT | New key added to filter |
UPDATE | Old key marks filter dirty; new key added |
DELETE | Filter marked dirty (deleted keys cannot be removed from a standard Bloom filter) |
DROP INDEX | Filter removed from memory |
| Server restart | Filters start empty; might_exist returns true (conservative) until CREATE INDEX is run again |
Dirty Filters
After a DELETE or UPDATE, the filter is marked dirty: it may still
return “maybe” for keys that were deleted. This does not affect correctness —
the B-Tree lookup simply finds no matching row. It only means that some absent
keys may not benefit from the zero-I/O shortcut until the filter is rebuilt via
ANALYZE TABLE (available since Phase 6.12).
ANALYZE TABLE t periodically to rebuild the filter and restore
optimal miss performance.
Dropping an Index
-- MySQL syntax (required when the server is in MySQL wire protocol mode)
DROP INDEX index_name ON table_name;
DROP INDEX IF EXISTS idx_old ON table_name;
Dropping an index frees all B-Tree pages, reclaiming disk space immediately.
Dropping an index that backs a PRIMARY KEY or UNIQUE constraint requires dropping the
constraint first (via ALTER TABLE DROP CONSTRAINT).
Index Introspection
-- All indexes on a table
SELECT index_name, is_unique, is_primary, columns
FROM axiom_indexes
WHERE table_name = 'orders'
ORDER BY is_primary DESC, index_name;
-- Root page of each index (useful for storage analysis)
SELECT index_name, root_page_id
FROM axiom_indexes;
Index-Only Scans (Covering Indexes)
When every column referenced by a SELECT is already stored as a key column of
the chosen index, AxiomDB can satisfy the query entirely from the B-Tree — no
heap page read is needed. This is called an index-only scan.
Example
CREATE INDEX idx_age ON users (age);
-- Index-only scan: only column needed (age) is the index key.
SELECT age FROM users WHERE age = 25;
The executor reads the matching B-Tree leaf entries, extracts the age value
from the encoded key bytes, and returns the rows without ever touching the heap.
INCLUDE syntax — declaring covering intent
You can declare additional columns as part of a covering index using the
INCLUDE clause:
CREATE INDEX idx_name_dept ON employees (name) INCLUDE (department, salary);
INCLUDE columns are recorded in the catalog metadata so the planner knows
the index covers those columns. Note: physical storage of INCLUDE column
values in B-Tree leaf nodes is deferred to a future covering-index phase. Until then, the planner
uses INCLUDE to correctly identify IndexOnlyScan opportunities, but the
values are read from the key portion of the B-Tree entry.
MVCC and the 24-byte header read
Index-only scans still perform a lightweight visibility check per row. For each
B-Tree entry, the executor reads only the 24-byte RowHeader (the slot header
containing txn_id_created, txn_id_deleted, and sequence number) to determine
whether the row is visible to the current transaction snapshot. The full row
payload is never decoded.
Non-Unique Secondary Index Key Format
Non-unique secondary indexes store the indexed column values together with the
row’s RecordId as the B-Tree key:
key = encode_index_key(col_vals) || encode_rid(rid) // 10-byte RecordId suffix
This ensures every B-Tree entry is globally unique even when multiple rows share
the same indexed value — making INSERT safe without a DuplicateKey error.
When looking up all rows with a given indexed value, the executor performs a range scan with synthetic bounds:
lo = encode_index_key(val) || [0x00; 10] // smallest possible RecordId
hi = encode_index_key(val) || [0xFF; 10] // largest possible RecordId
RecordId (page_id + slot_id + sequence number) instead of a
separate primary key column, keeping the suffix at a fixed 10 bytes regardless
of the table's key type.
Phase 39.9 adds a second, internal-only secondary-key path for the clustered
rewrite: there the physical entry is secondary_key ++ missing_primary_key_columns
so a future clustered executor can jump from a secondary entry to the owning
PRIMARY KEY row without depending on a heap slot. Phases 39.11 and 39.12
already add internal WAL/rollback and crash recovery for clustered rows by
primary key and exact row image, but that path is still not SQL-visible yet.
B+ Tree Implementation Details
AxiomDB’s B+ Tree is a Copy-on-Write structure backed by the StorageEngine trait.
Key properties:
- ORDER_INTERNAL = 223: up to 223 separator keys and 224 child pointers per internal node
- ORDER_LEAF = 217: up to 217 (key, RecordId) pairs per leaf node
- 16 KB pages: both internal and leaf nodes fit exactly in one page
- AtomicU64 root: root page swapped atomically — readers are lock-free
- CoW semantics: writes copy the path from root to the modified leaf; old versions are visible to concurrent readers until they finish
See B+ Tree Internals for the on-disk format and the derivation of the ORDER constants.
Embedded Mode
AxiomDB can run in-process — inside your application, with no TCP server, no daemon, no network round-trips. This is the SQLite model: the database is a library you link against, not a process you connect to.
The embedded crate ships two APIs:
| API | Language | Use case |
|---|---|---|
Db | Rust | Native Rust apps, desktop, CLI tools |
axiomdb_open / axiomdb_query / … | C | C, C++, Python (ctypes), Swift, Kotlin JNI, Unity |
AsyncDb | Rust + Tokio | Async Rust services |
5.15, Db::open_dsn and axiomdb_open_dsn accept filesystem DSNs and reject remote wire endpoints explicitly.
Build profiles
# Cargo.toml
[dependencies]
axiomdb-embedded = { path = "...", features = ["desktop"] } # default
# axiomdb-embedded = { path = "...", features = ["async-api"] } # + tokio
| Feature | Includes | Binary output |
|---|---|---|
desktop (default) | Rust sync API + C FFI | .dylib / .so / .dll + .a |
async-api | + tokio async wrapper | same + async |
wasm | sync, in-memory (future) | .wasm |
The desktop build produces a ~1.1 MB dynamic library. The server binary (with full wire protocol) is ~2.1 MB. You get a leaner binary by only linking what you need.
Rust API
Opening a database
#![allow(unused)]
fn main() {
use axiomdb_embedded::Db;
// Creates ./myapp.db and ./myapp.wal if they don't exist.
// Runs crash recovery automatically if the WAL has uncommitted entries.
// Also verifies every catalog-visible index before returning the handle.
let mut db = Db::open("./myapp.db")?;
let mut db2 = Db::open_dsn("file:/tmp/myapp.db")?;
let mut db3 = Db::open_dsn("axiomdb:///tmp/myapp")?;
}
Remote DSNs such as postgres://user@127.0.0.1:5432/app are not supported by
embedded mode in Phase 5.15 and return DbError::InvalidDsn.
DbError::IndexIntegrityFailure and the handle is never created.
DDL and DML
#![allow(unused)]
fn main() {
db.execute("CREATE TABLE users (id INT NOT NULL, name TEXT, score REAL)")?;
let affected = db.execute("INSERT INTO users VALUES (1, 'Alice', 9.5)")?;
assert_eq!(affected, 1);
let affected = db.execute("UPDATE users SET score = 10.0 WHERE id = 1")?;
assert_eq!(affected, 1);
let affected = db.execute("DELETE FROM users WHERE score < 5.0")?;
}
SELECT — rows only
#![allow(unused)]
fn main() {
let rows = db.query("SELECT * FROM users WHERE score > 8.0")?;
for row in &rows {
// row is Vec<Value> — one Value per column
println!("{:?}", row);
}
}
SELECT — rows + column names
Use query_with_columns when you need the column names at runtime (building a
table display, serializing to JSON, passing headers to a UI component, etc.).
#![allow(unused)]
fn main() {
let (columns, rows) = db.query_with_columns("SELECT id, name FROM users")?;
println!("columns: {:?}", columns); // ["id", "name"]
for row in &rows {
for (col, val) in columns.iter().zip(row.iter()) {
println!("{col} = {val}");
}
}
}
Full QueryResult (metadata + last_insert_id)
#![allow(unused)]
fn main() {
use axiomdb_sql::result::QueryResult;
match db.run("INSERT INTO users VALUES (2, 'Bob', 7.2)")? {
QueryResult::Affected { count, last_insert_id } => {
println!("inserted {count} row, id = {:?}", last_insert_id);
}
QueryResult::Rows { columns, rows } => { /* SELECT */ }
QueryResult::Empty => { /* DDL */ }
}
}
Explicit transactions
#![allow(unused)]
fn main() {
db.begin()?;
db.execute("INSERT INTO orders VALUES (1, 100.0)")?;
db.execute("UPDATE inventory SET qty = qty - 1 WHERE id = 42")?;
db.commit()?;
// Or:
db.begin()?;
// ... something goes wrong ...
db.rollback()?;
}
Error handling
#![allow(unused)]
fn main() {
match db.query("SELECT * FROM nonexistent") {
Ok(rows) => { /* ... */ }
Err(e) => {
eprintln!("query failed: {e}");
// Also accessible as a string for display/logging:
if let Some(msg) = db.last_error() {
eprintln!("last error: {msg}");
}
}
}
}
Async (Tokio)
use axiomdb_embedded::async_db::AsyncDb;
#[tokio::main]
async fn main() {
let db = AsyncDb::open("./myapp.db").await?;
let db2 = AsyncDb::open_dsn("file:/tmp/myapp.db").await?;
db.execute("CREATE TABLE t (id INT)").await?;
let (columns, rows) = db.query_with_columns("SELECT * FROM t").await?;
}
AsyncDb wraps Db in an Arc<Mutex<Db>> and runs each operation in
tokio::task::spawn_blocking, keeping the async executor unblocked.
Persist and reopen
The database persists on disk. Close it (drop the Db) and reopen it from
another process or session:
#![allow(unused)]
fn main() {
{
let mut db = Db::open("./data.db")?;
db.execute("CREATE TABLE log (ts BIGINT, msg TEXT)")?;
db.execute("INSERT INTO log VALUES (1700000000, 'started')")?;
} // db is dropped here — WAL is flushed, file lock released
// Later — in the same process or a different one:
let mut db = Db::open("./data.db")?;
let rows = db.query("SELECT * FROM log")?;
assert_eq!(rows.len(), 1);
}
C API
Link against libaxiomdb_embedded.{so,dylib,dll} or the static libaxiomdb_embedded.a.
Header
#include "axiomdb.h"
A minimal axiomdb.h to copy into your project:
#pragma once
#include <stdint.h>
#include <stddef.h>
typedef struct AxiomDb AxiomDb;
typedef struct AxiomRows AxiomRows;
/* Type codes — same as SQLite for easy porting */
#define AXIOMDB_NULL 0
#define AXIOMDB_INTEGER 1 /* Bool, Int, BigInt, Date (days), Timestamp (µs) */
#define AXIOMDB_REAL 2 /* Real, Decimal */
#define AXIOMDB_TEXT 3 /* Text, UUID */
#define AXIOMDB_BLOB 4 /* Bytes */
/* Open / close */
AxiomDb* axiomdb_open (const char* path);
AxiomDb* axiomdb_open_dsn (const char* dsn);
void axiomdb_close (AxiomDb* db);
/* Execute DML/DDL — returns rows affected, or -1 on error */
int64_t axiomdb_execute (AxiomDb* db, const char* sql);
/* Query — returns result set, or NULL on error */
AxiomRows* axiomdb_query (AxiomDb* db, const char* sql);
/* Result set accessors */
int64_t axiomdb_rows_count (const AxiomRows* rows);
int32_t axiomdb_rows_columns (const AxiomRows* rows);
const char* axiomdb_rows_column_name (const AxiomRows* rows, int32_t col);
int32_t axiomdb_rows_type (const AxiomRows* rows, int64_t row, int32_t col);
int64_t axiomdb_rows_get_int (const AxiomRows* rows, int64_t row, int32_t col);
double axiomdb_rows_get_double (const AxiomRows* rows, int64_t row, int32_t col);
const char* axiomdb_rows_get_text (const AxiomRows* rows, int64_t row, int32_t col);
const uint8_t* axiomdb_rows_get_blob (const AxiomRows* rows, int64_t row, int32_t col, size_t* len);
void axiomdb_rows_free (AxiomRows* rows);
/* Last error message for this db handle — NULL if last operation succeeded */
const char* axiomdb_last_error (const AxiomDb* db);
Complete example
#include <stdio.h>
#include <stdint.h>
#include "axiomdb.h"
int main(void) {
AxiomDb* db = axiomdb_open("./app.db");
AxiomDb* db2 = axiomdb_open_dsn("file:/tmp/app.db");
if (!db) { fprintf(stderr, "failed to open db\n"); return 1; }
axiomdb_execute(db,
"CREATE TABLE IF NOT EXISTS products ("
" id INT NOT NULL, name TEXT, price REAL, active INTEGER"
")");
axiomdb_execute(db, "INSERT INTO products VALUES (1, 'Widget', 9.99, 1)");
axiomdb_execute(db, "INSERT INTO products VALUES (2, 'Gadget', 24.50, 1)");
axiomdb_execute(db, "INSERT INTO products VALUES (3, 'Donut', 1.25, 0)");
AxiomRows* rows = axiomdb_query(db,
"SELECT id, name, price FROM products WHERE active = 1");
if (!rows) {
fprintf(stderr, "query error: %s\n", axiomdb_last_error(db));
axiomdb_close(db);
return 1;
}
/* Print header */
int32_t ncols = axiomdb_rows_columns(rows);
for (int32_t c = 0; c < ncols; c++) {
printf("%-12s", axiomdb_rows_column_name(rows, c));
}
printf("\n");
/* Print rows */
int64_t nrows = axiomdb_rows_count(rows);
for (int64_t r = 0; r < nrows; r++) {
for (int32_t c = 0; c < ncols; c++) {
switch (axiomdb_rows_type(rows, r, c)) {
case AXIOMDB_INTEGER:
printf("%-12lld", (long long)axiomdb_rows_get_int(rows, r, c));
break;
case AXIOMDB_REAL:
printf("%-12.2f", axiomdb_rows_get_double(rows, r, c));
break;
case AXIOMDB_TEXT:
printf("%-12s", axiomdb_rows_get_text(rows, r, c));
break;
case AXIOMDB_NULL:
printf("%-12s", "NULL");
break;
default:
printf("%-12s", "?");
}
}
printf("\n");
}
axiomdb_rows_free(rows);
axiomdb_close(db);
axiomdb_close(db2);
return 0;
}
Output:
id name price
1 Widget 9.99
2 Gadget 24.50
Type mapping
| SQL type | C accessor | C type |
|---|---|---|
BOOL | axiomdb_rows_get_int | 0 or 1 |
INT | axiomdb_rows_get_int | int64_t |
BIGINT | axiomdb_rows_get_int | int64_t |
REAL / DOUBLE | axiomdb_rows_get_double | double |
DECIMAL | axiomdb_rows_get_double | double (may lose precision for >15 digits) |
TEXT / VARCHAR | axiomdb_rows_get_text | const char* (UTF-8) |
UUID | axiomdb_rows_get_text | const char* (xxxxxxxx-xxxx-…) |
DATE | axiomdb_rows_get_int | days since 1970-01-01 |
TIMESTAMP | axiomdb_rows_get_int | microseconds since 1970-01-01 UTC |
BLOB / BYTEA | axiomdb_rows_get_blob | const uint8_t* + size_t len |
NULL | type code = AXIOMDB_NULL | — |
axiomdb_rows_get_text, axiomdb_rows_get_blob, and axiomdb_rows_column_name are valid until axiomdb_rows_free is called. Copy the data if you need it to outlive the result set.
Python (ctypes)
import ctypes, os
lib = ctypes.CDLL("./libaxiomdb_embedded.dylib") # or .so on Linux
lib.axiomdb_open.restype = ctypes.c_void_p
lib.axiomdb_open.argtypes = [ctypes.c_char_p]
lib.axiomdb_execute.restype = ctypes.c_int64
lib.axiomdb_execute.argtypes = [ctypes.c_void_p, ctypes.c_char_p]
lib.axiomdb_query.restype = ctypes.c_void_p
lib.axiomdb_query.argtypes = [ctypes.c_void_p, ctypes.c_char_p]
lib.axiomdb_rows_count.restype = ctypes.c_int64
lib.axiomdb_rows_count.argtypes = [ctypes.c_void_p]
lib.axiomdb_rows_get_text.restype = ctypes.c_char_p
lib.axiomdb_rows_get_text.argtypes = [ctypes.c_void_p, ctypes.c_int64, ctypes.c_int32]
lib.axiomdb_rows_free.argtypes = [ctypes.c_void_p]
lib.axiomdb_close.argtypes = [ctypes.c_void_p]
db = lib.axiomdb_open(b"./app.db")
lib.axiomdb_execute(db, b"CREATE TABLE IF NOT EXISTS t (id INT, name TEXT)")
lib.axiomdb_execute(db, b"INSERT INTO t VALUES (1, 'hello')")
rows = lib.axiomdb_query(db, b"SELECT id, name FROM t")
for r in range(lib.axiomdb_rows_count(rows)):
id_ = lib.axiomdb_rows_get_text(rows, r, 0)
name = lib.axiomdb_rows_get_text(rows, r, 1)
print(f"id={id_.decode()}, name={name.decode()}")
lib.axiomdb_rows_free(rows)
lib.axiomdb_close(db)
Build the shared library
# Dynamic library (.dylib / .so / .dll)
cargo build --release -p axiomdb-embedded
# Static library (.a) — for iOS, embedded systems, Unity AOT
cargo build --release -p axiomdb-embedded
# → target/release/libaxiomdb_embedded.a
# With async support
cargo build --release -p axiomdb-embedded --features async-api
Output files are in target/release/:
- macOS:
libaxiomdb_embedded.dylib - Linux:
libaxiomdb_embedded.so - Windows:
axiomdb_embedded.dll - All platforms:
libaxiomdb_embedded.a(static)
Error Reference
AxiomDB returns structured errors with a SQLSTATE code, a human-readable message, and optional detail fields. Understanding these codes allows applications to handle specific failure scenarios correctly (for example: catching a uniqueness violation to show a “email already taken” message rather than a generic crash page).
Error Format
Every error from AxiomDB is represented as an ErrorResponse struct with these fields:
| Field | Type | Always present? | Description |
|---|---|---|---|
sqlstate | string (5 chars) | Yes | SQLSTATE code for programmatic handling (e.g. "23505") |
severity | string | Yes | "ERROR", "WARNING", or "NOTICE" |
message | string | Yes | Short human-readable description. Do not parse this — use sqlstate |
detail | string | Sometimes | Extended context about the failure (offending value, referenced row) |
hint | string | Sometimes | Actionable suggestion for how to fix the error |
position | integer | Sometimes | 0-based byte offset of the unexpected token in the SQL string (parse errors only) |
{
"sqlstate": "23505",
"severity": "ERROR",
"message": "unique key violation on index 'users_email_idx'",
"detail": "Key (value)=(alice@example.com) is already present in index users_email_idx.",
"hint": "A row with the same value already exists in index users_email_idx. Use INSERT ... ON CONFLICT to handle duplicates."
}
{
"sqlstate": "42601",
"severity": "ERROR",
"message": "SQL syntax error: unexpected token 'FORM'",
"position": 9
}
Always use sqlstate for programmatic handling. The message text may change between versions; SQLSTATE codes are stable.
When using the MySQL wire protocol, the error is delivered as a MySQL error packet
with the SQLSTATE code in the sql_state field (5 bytes following the # marker).
JSON Error Format
For clients that need structured errors without screen-scraping message strings, AxiomDB
supports a JSON error format that carries all ErrorResponse fields in the MySQL ERR packet:
SET error_format = 'json';
After this, every ERR packet carries a JSON string instead of plain text:
{"code":1064,"sqlstate":"42601","severity":"ERROR","message":"SQL syntax error: unexpected token 'FORM'","position":9}
{"code":1062,"sqlstate":"23505","severity":"ERROR","message":"unique key violation on index 'users_email_idx'","detail":"Key (value)=(alice@example.com) is already present in index users_email_idx."}
Reset to plain text with SET error_format = 'text'. This setting is per-connection and
does not persist after disconnect.
ErrorResponse as a JSON string in that message field when
error_format = 'json' is set. This mirrors how PostgreSQL's ErrorResponse
packet carries detail, hint, and position in separate fields —
achieving the same semantics over MySQL's more limited protocol.
Integrity Constraint Violations (Class 23)
These errors indicate that an INSERT, UPDATE, or DELETE violated a declared constraint. The application should handle them and return a user-facing message.
- NOT NULL — declared columns accept NULL without error
- UNIQUE — duplicate values are allowed
- CHECK — expressions are not evaluated at write time
23502, 23505, and 23514 are not raised
by DML in the current release. Enforcement will be added in a future phase.
PRIMARY KEY uniqueness is enforced via the B+ tree index.
23505 — unique_violation
A row with the same value already exists in a column or set of columns declared UNIQUE or PRIMARY KEY.
CREATE TABLE users (email TEXT NOT NULL UNIQUE);
INSERT INTO users VALUES ('alice@example.com');
INSERT INTO users VALUES ('alice@example.com'); -- ERROR 23505
The error message identifies both the index and the offending value:
Duplicate entry 'alice@example.com' for key 'users_email_uq'
The detail field (available in JSON format) provides a PostgreSQL-style message:
Key (value)=(alice@example.com) is already present in index users_email_uq.
Typical application response: Show “An account with this email already exists.”
try:
db.execute("INSERT INTO users (email) VALUES (?)", [email])
except AxiomDbError as e:
if e.sqlstate == '23505':
return {"error": "Email already taken"}
raise
23503 — foreign_key_violation
Child insert / update — parent key does not exist
An INSERT or UPDATE references a value in the FK column that has no matching row in the parent table.
INSERT INTO orders (user_id, total) VALUES (99999, 100);
-- ERROR 23503: Foreign key constraint fails: 'orders.user_id' = '99999'
Typical response: Validate that the referenced entity exists before inserting, or surface “Referenced record not found.”
Parent delete — children still reference it (RESTRICT / NO ACTION)
A DELETE on the parent table was blocked because child rows reference the row
being deleted and the FK action is RESTRICT or NO ACTION (the default).
-- orders.user_id REFERENCES users(id) ON DELETE RESTRICT
DELETE FROM users WHERE id = 1;
-- ERROR 23503: foreign key constraint "fk_orders_user": orders.user_id references this row
Typical response: Either delete child rows first, use ON DELETE CASCADE, or
prevent parent deletion in the application layer.
Cascade depth exceeded
A chain of ON DELETE CASCADE constraints exceeded the maximum depth of 10 levels.
-- If table chain A→B→C→...→K (11 levels all with CASCADE) and you delete from A:
DELETE FROM a WHERE id = 1;
-- ERROR 23503: foreign key cascade depth exceeded limit of 10
Typical response: Restructure the schema to reduce cascade depth, or perform the deletes manually level-by-level.
SET NULL on a NOT NULL column
ON DELETE SET NULL is defined on a foreign key column that was declared NOT NULL.
-- orders.user_id is NOT NULL, but ON DELETE SET NULL is declared
DELETE FROM users WHERE id = 1;
-- ERROR 23503: cannot set FK column orders.user_id to NULL: column is NOT NULL
Typical response: Either remove the NOT NULL constraint from the FK column,
or change the action to ON DELETE RESTRICT or ON DELETE CASCADE.
23502 — not_null_violation
An INSERT or UPDATE attempted to store NULL in a NOT NULL column.
INSERT INTO users (name, email) VALUES (NULL, 'bob@example.com');
-- ERROR 23502: null value in column "name" violates not-null constraint
Typical application response: Validate required fields on the client before submitting.
23514 — check_violation
A row failed a CHECK constraint.
INSERT INTO products (name, price) VALUES ('Widget', -5.00);
-- ERROR 23514: new row for relation "products" violates check constraint "chk_price_positive"
Startup / Open Errors
These errors happen before a SQL statement runs. They are returned by
Db::open(...), Db::open_dsn(...), AsyncDb::open(...), or server startup,
so there is no SQLSTATE-bearing result set yet.
IndexIntegrityFailure — open refused because an index is not trustworthy
On every open, AxiomDB now verifies each catalog-visible index against the heap-visible rows reconstructed after WAL recovery.
- If an index is readable but missing entries or contains extra entries, AxiomDB rebuilds it automatically before accepting traffic.
- If the index tree cannot be traversed safely, open fails with
DbError::IndexIntegrityFailure.
Example Rust handling:
#![allow(unused)]
fn main() {
match axiomdb_embedded::Db::open("./data.db") {
Ok(db) => { /* ready */ }
Err(axiomdb_core::DbError::IndexIntegrityFailure { table, index, reason }) => {
eprintln!("database refused to open: {table}.{index}: {reason}");
}
Err(other) => return Err(other),
}
}
amcheck's “fail if the tree is not safely readable”
discipline, but borrows SQLite's “rebuild from table contents” recovery idea for readable
divergence. A readable-but-wrong index is rebuilt automatically; an unreadable tree blocks open.
Cardinality Errors (Class 21)
21000 — cardinality_violation
A scalar subquery returned more than one row. Scalar subqueries (a SELECT used
where a single value is expected) must return exactly one row. Zero rows yield
NULL; more than one row is an error.
-- Suppose users contains Alice and Bob
SELECT (SELECT name FROM users) AS single_name FROM orders;
-- ERROR 21000: subquery must return exactly one row, but returned 2 rows
Fix: add a WHERE condition that makes the result unique, or use LIMIT 1
if you intentionally want only the first row:
-- Safe: guaranteed single row via primary key
SELECT (SELECT name FROM users WHERE id = o.user_id) AS customer_name
FROM orders o;
-- Safe: explicit LIMIT 1 when you want "any one" result
SELECT (SELECT name FROM users ORDER BY created_at LIMIT 1) AS oldest_user
FROM config;
try:
db.execute("SELECT (SELECT name FROM users) FROM orders")
except AxiomDbError as e:
if e.sqlstate == '21000':
# The subquery returned multiple rows — add a WHERE clause
...
Undefined Object Errors (Class 42)
These errors indicate a reference to an object (table, column, index) that does not exist in the catalog. They are typically programming errors caught in development.
42P01 — undefined_table
A statement referenced a table or view that does not exist.
SELECT * FROM nonexistent_table;
-- ERROR 42P01: relation "nonexistent_table" does not exist
42703 — undefined_column
A statement referenced a column that does not exist in the specified table.
SELECT typo_column FROM users;
-- ERROR 42703: column "typo_column" does not exist in table "users"
42P07 — duplicate_table
CREATE TABLE was called for a table that already exists (without IF NOT EXISTS).
CREATE TABLE users (...);
CREATE TABLE users (...);
-- ERROR 42P07: relation "users" already exists
42701 — duplicate_column
ALTER TABLE ... ADD COLUMN was called for a column that already exists in
the table.
CREATE TABLE users (id BIGINT PRIMARY KEY, email TEXT NOT NULL);
ALTER TABLE users ADD COLUMN email TEXT;
-- ERROR 42701: column "email" already exists in table "users"
Fix: Use a different column name, or check the current schema with
DESCRIBE users before adding the column.
42702 — ambiguous_column
An unqualified column name appears in multiple tables in the FROM clause.
-- Both users and orders have a column named "id"
SELECT id FROM users JOIN orders ON orders.user_id = users.id;
-- ERROR 42702: column reference "id" is ambiguous
-- Fix: qualify the column
SELECT users.id FROM users JOIN orders ON orders.user_id = users.id;
Database Catalog Errors
These errors are surfaced primarily through the MySQL wire protocol when a
client uses CREATE DATABASE, DROP DATABASE, USE, the handshake database,
or COM_INIT_DB.
1049 (42000) — Unknown database
The requested database does not exist in the persisted catalog.
USE missing_db;
-- ERROR 1049 (42000): Unknown database 'missing_db'
This same error is returned if a client connects with database=missing_db in
the initial MySQL handshake.
Fix: create the database first with CREATE DATABASE missing_db, or switch
to an existing one from SHOW DATABASES.
1007 (HY000) — Database already exists
CREATE DATABASE was called for a name already present in the catalog.
CREATE DATABASE analytics;
CREATE DATABASE analytics;
-- ERROR 1007 (HY000): Can't create database 'analytics'; database exists
Fix: choose a different name, or treat the existing database as reusable.
1105 (HY000) — Active database cannot be dropped
The current connection attempted to drop the database it has selected.
USE analytics;
DROP DATABASE analytics;
-- ERROR 1105 (HY000): Can't drop database 'analytics'; database is currently selected
Fix: switch to another database such as axiomdb, then run DROP DATABASE.
Transaction Errors (Class 40)
40001 — serialization_failure
A concurrent write conflict was detected. The transaction must be retried.
-- Two transactions try to update the same row simultaneously.
-- The second one receives:
-- ERROR 40001: could not serialize access due to concurrent update
The application must catch this and retry the transaction. This is normal and expected behavior under high concurrency, not a bug.
40P01 — deadlock_detected
Two transactions are each waiting for a lock held by the other.
-- Txn A holds lock on row 1, waiting for row 2
-- Txn B holds lock on row 2, waiting for row 1
-- → AxiomDB detects the cycle and aborts one transaction with 40P01
-- ERROR 40P01: deadlock detected
Prevention: Access rows in a consistent order across all transactions. If you always acquire locks on (accounts with lower id) before (accounts with higher id), deadlocks cannot form between two such transactions.
I/O and System Errors (Class 58)
58030 — io_error
The storage engine encountered an operating system I/O error.
ERROR 58030: could not write to file "axiomdb.db": No space left on device
Possible causes:
- Disk full — free space or expand the volume
- File permissions — ensure the AxiomDB process can write to the data directory
- Hardware error — check dmesg / system logs for disk errors
Syntax and Parse Errors (Class 42)
42601 — syntax_error
The SQL statement is not syntactically valid.
SELECT FORM users; -- 'FORM' is not a keyword
-- ERROR 42601: syntax error at or near "FORM"
-- Position: 8
42883 — undefined_function
A function name was called that does not exist.
SELECT unknown_function(1);
-- ERROR 42883: function "unknown_function" does not exist
Data Errors (Class 22)
22001 — string_data_right_truncation
A TEXT or VARCHAR value exceeds the column’s declared length.
CREATE TABLE codes (code CHAR(3));
INSERT INTO codes VALUES ('TOOLONG');
-- ERROR 22001: value too long for type CHAR(3)
22003 — numeric_value_out_of_range
A numeric value exceeds the range of its declared type.
INSERT INTO users (age) VALUES (99999); -- age is SMALLINT
-- ERROR 22003: integer out of range for type SMALLINT
22012 — division_by_zero
Division by zero in an arithmetic expression.
SELECT 10 / 0;
-- ERROR 22012: division by zero
22018 — invalid_character_value_for_cast
A value cannot be implicitly coerced to the target type. This error is raised when AxiomDB is in strict mode (the default) and a conversion is attempted that would discard data or is not defined.
-- Text with non-numeric characters inserted into an INT column (strict mode):
INSERT INTO users (age) VALUES ('42abc');
-- ERROR 22018: cannot coerce '42abc' (Text) to INT: '42abc' is not a valid integer
-- A type pair with no implicit conversion:
SELECT 3.14 + DATE '2026-01-01';
-- ERROR 22018: cannot coerce 3.14 (Real) to Date: no implicit numeric promotion between these types
Hint: Use explicit CAST for conversions that AxiomDB does not apply
automatically:
INSERT INTO users (age) VALUES (CAST('42' AS INT)); -- explicit — always works
SELECT CAST(3 AS REAL) + 1.5; -- explicit widening
Permissive mode: if your application requires MySQL-style lenient coercion
('42abc' silently converted to 42), disable strict mode for the session:
SET strict_mode = OFF; -- or: SET sql_mode = ''
In permissive mode, failed coercions fall back to a best-effort conversion and
emit warning 1265 instead of returning 22018. Use SHOW WARNINGS after
bulk loads to audit any truncated values. See
Strict Mode for full details.
Implicit coercions that always succeed (no error)
The following conversions happen automatically without raising 22018:
| From | To | Example |
|---|---|---|
INT | BIGINT | 1 + 9999999999 → BIGINT |
INT | REAL | 5 + 1.5 → Real(6.5) |
INT | DECIMAL | 2 + 3.14 → Decimal(5.14) |
BIGINT | REAL | 100 + 1.5 → Real(101.5) |
BIGINT | DECIMAL | 100 + 3.14 → Decimal(103.14) |
BIGINT | INT | only if value fits in INT range |
TEXT | INT / BIGINT | '42' → 42 (strict: entire string must be a number) |
TEXT | REAL | '3.14' → 3.14 |
TEXT | DECIMAL | '3.14' → Decimal(314, 2) |
DATE | TIMESTAMP | midnight UTC of the given date |
NULL | any | always passes through as NULL |
Connection Protocol Errors (Class 08)
MySQL 1153 / 08S01 — ER_NET_PACKET_TOO_LARGE
Returned when an incoming MySQL logical command payload exceeds the connection’s
current max_allowed_packet limit.
ERROR 1153 (08S01): Got a packet bigger than 'max_allowed_packet' bytes
What triggers it:
- A
COM_QUERYwhose SQL text exceeds@@max_allowed_packetbytes. - A
COM_STMT_PREPAREorCOM_STMT_EXECUTEpacket above the limit. - A
HandshakeResponse41above the default 64 MiB limit (rare in practice). - A multi-packet logical command whose total reassembled payload exceeds the limit, even if each individual physical fragment is below the limit.
What happens after the error: The server closes the connection immediately. The stream cannot be safely reused because the framing layer cannot determine where the next command begins.
Fix: Raise max_allowed_packet before sending the large command:
SET max_allowed_packet = 134217728; -- 128 MiB
Or reconnect after the error — the new connection starts with the server default.
SET max_allowed_packet affects only the current connection. Use it before
any statement whose payload may be large (e.g., bulk INSERT with many values, or
a BLOB upload via COM_STMT_EXECUTE).
Disk-Full Errors (Class 53)
53100 — disk_full
Returned when the OS reports that the volume is full (ENOSPC) or over quota
(EDQUOT) during a durable write — a WAL append, WAL fsync, storage grow, or
mmap flush.
ERROR 53100: disk full during 'wal commit fsync': no space left on device
HINT: The database volume is full or over quota. Free disk space and restart
the server to restore write access. The database is now in read-only
degraded mode.
What happens after the error:
AxiomDB enters read-only degraded mode immediately. In this mode:
| Statement type | Allowed? |
|---|---|
SELECT, SHOW, EXPLAIN | ✅ Yes |
SET (session variables) | ✅ Yes |
INSERT, UPDATE, DELETE, TRUNCATE | ❌ No — returns 53100 |
CREATE TABLE, DROP TABLE, DDL | ❌ No — returns 53100 |
BEGIN, COMMIT, ROLLBACK | ❌ No — returns 53100 |
The mode persists until the server process is restarted. There is no way to return to read-write mode without restarting.
Fix:
- Free disk space or remove the quota restriction.
- Restart the server — AxiomDB will reopen in read-write mode if space is available.
Complete SQLSTATE Reference
| SQLSTATE | Name | Common Cause |
|---|---|---|
21000 | cardinality_violation | Scalar subquery returned more than 1 row |
23505 | unique_violation | Duplicate value in UNIQUE / PK column |
23503 | foreign_key_violation | Referencing non-existent FK target |
23502 | not_null_violation | NULL inserted into NOT NULL column |
23514 | check_violation | Row failed a CHECK constraint |
40001 | serialization_failure | Write-write conflict; retry the txn |
40P01 | deadlock_detected | Circular lock dependency |
42P01 | undefined_table | Table does not exist |
42703 | undefined_column | Column does not exist |
42702 | ambiguous_column | Unqualified column name is ambiguous |
42P07 | duplicate_table | Table already exists |
42701 | duplicate_column | Column already exists in table |
42601 | syntax_error | Malformed SQL |
42883 | undefined_function | Unknown function name |
22001 | string_data_right_truncation | Value too long for column type |
22003 | numeric_value_out_of_range | Number exceeds type bounds |
22012 | division_by_zero | Division by zero in expression |
22018 | invalid_character_value_for_cast | Implicit type coercion failed |
22P02 | invalid_text_representation | Invalid literal value |
42501 | insufficient_privilege | Permission denied on object |
42702 | ambiguous_column | Unqualified column matches in 2+ tables |
42804 | datatype_mismatch | Type mismatch in expression |
25001 | active_sql_transaction | BEGIN inside an active transaction |
25P01 | no_active_sql_transaction | COMMIT/ROLLBACK with no active transaction |
25006 | read_only_sql_transaction | Transaction expired |
0A000 | feature_not_supported | SQL feature not yet implemented |
08S01 | connection_failure (MySQL ext) | Incoming packet exceeds max_allowed_packet |
53100 | disk_full | Storage volume is full |
58030 | io_error | OS-level I/O failure (disk, permissions) |
Performance
AxiomDB is designed to outperform MySQL on specific workloads by eliminating several layers of redundant work: double-buffering, the double-write buffer, row-by-row query evaluation, and thread-per-connection overhead. This page presents current benchmark numbers and guidance on how to write queries and schemas that stay fast.
Benchmark Results
All benchmarks run on Apple M2 Pro (12 cores), 32 GB RAM, NVMe SSD, single-threaded, warm data (all pages in OS page cache unless noted).
SQL Parser Throughput
| Query type | AxiomDB (logos lexer) | MySQL ~ | PostgreSQL ~ | Ratio vs MySQL |
|---|---|---|---|---|
| Simple SELECT (1 tbl) | 492 ns | ~500 ns | ~450 ns | 1.0× (parity) |
| Complex SELECT (JOINs) | 2.7 µs | ~4.0 µs | ~3.5 µs | 1.5× faster |
| DDL (CREATE TABLE) | 1.1 µs | ~2.5 µs | ~2.0 µs | 2.3× faster |
| Batch (100 stmts) | 47 µs | ~90 µs | ~75 µs | 1.9× faster |
Compared to sqlparser-rs (the common Rust SQL parser library):
| Query type | AxiomDB | sqlparser-rs | Ratio |
|---|---|---|---|
| Simple SELECT | 492 ns | 4.8 µs | 9.8× faster |
| Complex SELECT | 2.7 µs | 46 µs | 17× faster |
The speed advantage comes from two decisions:
- logos DFA lexer — compiles the token patterns to a Deterministic Finite Automaton at compile time. Token scanning is O(n) with a very small constant.
- Zero-copy tokens —
IdentandQuotedIdenttokens are&'src strslices into the original input. No heap allocation occurs during lexing.
Storage Engine Throughput
| Operation | AxiomDB | Target | Max acceptable | Status |
|---|---|---|---|---|
| B+ Tree point lookup (1M) | 1.2M ops/s | 800K ops/s | 600K ops/s | ✅ |
| Range scan 10K rows | 0.61 ms | 45 ms | 60 ms | ✅ |
| B+ Tree INSERT (storage only) | 195K ops/s | 180K ops/s | 150K ops/s | ✅ |
| Sequential scan 1M rows | 0.72 s | 0.8 s | 1.2 s | ✅ |
| Concurrent reads ×16 | linear | linear | <2× degradation | ✅ |
Wire Protocol Throughput (Phase 5.14)
End-to-end throughput measured via the MySQL wire protocol (pymysql client, autocommit mode, 1 connection, localhost). Includes: network round-trip, protocol encode/decode, parse, analyze, execute, WAL, MmapStorage.
| Operation | Throughput | Notes |
|---|---|---|
| COM_PING | 24,865 pings/s | Pure protocol overhead baseline |
| SET NAMES (intercepted) | 46,672 q/s | Handled in protocol layer, no SQL engine |
| SELECT 1 (autocommit) | 185 q/s | Full SQL pipeline, read-only |
| INSERT (autocommit, 1 fsync/stmt) | 58 q/s | Full SQL pipeline + fsync for durability |
The 185 q/s SELECT result reflects a 3.3× improvement in Phase 5.14 over the prior 56 q/s baseline. Read-only transactions (SELECT, SHOW, etc.) no longer fsync the WAL — see Benchmarks → Phase 5.14 for the technical explanation.
Remaining bottlenecks:
- INSERT (single connection): one
fdatasyncper autocommit statement; enable Group Commit for concurrent workloads (see below)
Primary-Key Lookups After 6.16
Phase 6.16 removes the planner blind spot that still treated WHERE id = ...
as a scan on PK-only tables. The PRIMARY KEY B+Tree is now used for single-table
equality and range lookups.
Measured with python3 benches/comparison/local_bench.py --scenario select_pk --rows 5000 --table
on the same machine:
| Operation | MariaDB 12.1 | MySQL 8.0 | AxiomDB |
|---|---|---|---|
SELECT * FROM bench_users WHERE id = literal | 12.7K lookups/s | 13.4K lookups/s | 11.1K lookups/s |
The old debt was “planner never reaches the PK B+Tree”. That is now closed. The remaining gap is smaller and sits after planning: row materialization and MySQL packet serialization still cost more than MariaDB/MySQL on this path.
DELETE WHERE / UPDATE After 5.20
Phase 5.19 removed the old-key delete bottleneck for DELETE ... WHERE and the
old-key half of UPDATE. Phase 5.20 finishes the real UPDATE fix for the
benchmark schema by preserving the heap RecordId when the new row fits in the
same slot, which makes selective index skipping correct.
Measured with python3 benches/comparison/local_bench.py --scenario all --rows 50000 --table
on the same machine:
| Operation | MariaDB 12.1 | MySQL 8.0 | AxiomDB | PostgreSQL 16 |
|---|---|---|---|---|
DELETE WHERE id > 25000 | 652K rows/s | 662K rows/s | 1.13M rows/s | 3.76M rows/s |
UPDATE ... WHERE active = TRUE | 662K rows/s | 404K rows/s | 648K rows/s | 270K rows/s |
Compared to the 4.6K rows/s pre-5.19 DELETE-WHERE baseline that originally
triggered this work, AxiomDB now stays in the same order of magnitude as MySQL
and MariaDB on the same local benchmark. More importantly, compared to the
52.9K rows/s post-5.19 / pre-5.20 UPDATE baseline, the stable-RID path
raises AxiomDB UPDATE throughput to 648K rows/s on the same 50K-row benchmark.
The main remaining write-path bottleneck is now INSERT, not UPDATE.
Indexed UPDATE ... WHERE After 6.20
Phase 6.17 removed the old full-scan candidate discovery path for indexed
UPDATE predicates. Phase 6.20 then removed the dominant apply-side costs on
the default PK-range benchmark: candidate heap reads are batched by page,
no-op rows skip physical mutation, stable-RID rewrites batch their WAL append,
and index maintenance only runs when a key, predicate membership, or RID really
changes.
Measured with python3 benches/comparison/local_bench.py --scenario update_range --rows 5000 --table
on the same machine:
| Operation | MariaDB 12.1 | MySQL 8.0 | AxiomDB |
|---|---|---|---|
UPDATE bench_users SET score = score + 1 WHERE id BETWEEN ... | 618K rows/s | 291K rows/s | 369.9K rows/s |
Compared to the 6.17 result (85.2K rows/s), the 6.20 apply fast path is a
4.3x improvement on the same benchmark and now exceeds the documented local
MySQL result. The remaining gap is specifically MariaDB’s tighter clustered-row
update path, not AxiomDB’s old discovery-side O(n) scan.
INSERT in Explicit Transactions After 5.21
Phase 5.21 adds transactional INSERT staging for consecutive
INSERT ... VALUES statements inside one explicit transaction. Instead of
writing heap + WAL + index roots per statement, AxiomDB now buffers eligible
rows and flushes them together on COMMIT or the next barrier statement.
Measured with python3 benches/comparison/local_bench.py --scenario insert --rows 50000 --table
on the same machine:
| Operation | MariaDB 12.1 | MySQL 8.0 | AxiomDB |
|---|---|---|---|
50K single-row INSERTs in 1 explicit txn | 28.0K rows/s | 26.7K rows/s | 23.9K rows/s |
heap_multi_insert() and DuckDB's appender both separate row
production from physical write. AxiomDB adapts that idea to SQL-visible transactions:
the connection keeps staged INSERT rows in memory, then flushes them in one grouped
heap/index pass when SQL semantics require visibility.
This path targets one specific workload: many separate INSERT statements inside
BEGIN ... COMMIT. Autocommit throughput remains a different problem and
depends on the server-side fsync path.
Multi-row INSERT on Indexed Tables After 6.18
Phase 6.18 fixes the immediate multi-row VALUES path for indexed tables. A
statement such as:
INSERT INTO bench_users VALUES
(1, 'u1', 18, TRUE, 100.0, 'u1@b.local'),
(2, 'u2', 19, FALSE, 100.1, 'u2@b.local'),
(3, 'u3', 20, TRUE, 100.2, 'u3@b.local');
now uses grouped heap/index apply even when the target table has a PRIMARY KEY
or secondary indexes. Before 6.18, that path still fell back to per-row
maintenance on indexed tables.
Measured with python3 benches/comparison/local_bench.py --scenario insert_multi_values --rows 5000 --table
on the benchmark schema with PRIMARY KEY (id):
| Operation | MariaDB 12.1 | MySQL 8.0 | AxiomDB |
|---|---|---|---|
insert_multi_values on PK table | 160,581 rows/s | 259,854 rows/s | 321,002 rows/s |
INSERT ... VALUES (...), (...) statement instead of many one-row INSERTs. This now benefits indexed tables too, while still rejecting duplicate PRIMARY KEY / UNIQUE values inside the same statement.
Prepared Statement Plan Cache (Phase 5.13)
COM_STMT_PREPARE compiles the SQL once (parse + analyze). Every subsequent
COM_STMT_EXECUTE reuses the compiled plan — no re-parsing, no catalog scan:
| Path | Per-execute cost |
|---|---|
COM_QUERY (plain string) | parse + analyze + execute (~5 ms) |
COM_STMT_EXECUTE — plan valid | substitute params + execute (~0.1 ms) — 50× faster |
COM_STMT_EXECUTE — after DDL | re-analyze once, then fast path resumes |
Schema invalidation (correctness guarantee): after ALTER TABLE, DROP TABLE,
CREATE INDEX, etc., the cached plan is re-analyzed automatically on the next execute.
The schema_version counter in Database increments on every successful DDL; each
connection polls it lock-free (Arc<AtomicU64>) before each execute.
LRU eviction: each connection caches up to max_prepared_stmts_per_connection
(default 1024) compiled plans. The least-recently-used plan is evicted silently when
the limit is reached. Configurable in axiomdb.toml.
WAL Fsync Pipeline (6.19, closed with a documented gap)
Phase 6.19 replaced the old timer-based CommitCoordinator with an always-on
leader-based WAL fsync pipeline. The runtime behavior changed, but the key
single-connection autocommit benchmark remains a documented gap.
Measured with:
python3 benches/comparison/local_bench.py --scenario insert_autocommit --rows 1000 --table --engines axiomdb
Current result:
| Benchmark | AxiomDB | Target | Status |
|---|---|---|---|
insert_autocommit | 224 ops/s | >= 5,000 ops/s | ❌ |
group_commit_lock inspired the leader-based pipeline and it does remove the old timer window. But under a strict MySQL request/response client, the server still waits for durability before sending OK, so the next statement cannot arrive while the fsync is in flight. The batching primitive is therefore correct, but it does not solve the sequential single-client benchmark by itself.
End-to-End INSERT Throughput
Full pipeline: parse → analyze → execute → WAL → MmapStorage. Measured with
executor_e2e benchmark (MmapStorage + real WAL, release build, Apple M2 Pro NVMe).
| Configuration | AxiomDB | MariaDB ~ | Status |
|---|---|---|---|
| INSERT 10K rows / N separate SQL strings / 1 txn | 35K rows/s | 140K rows/s | ⚠️ |
| INSERT 10K rows / 1 multi-row SQL string | 211K rows/s | 140K rows/s | ✅ 1.5× faster |
| INSERT autocommit (1 visible commit/stmt, wire protocol) | 224 q/s | — | ⚠️ (closed subphase, open perf gap) |
INSERT INTO t VALUES (r1),(r2),...,(rN), AxiomDB reaches 211K rows/s
vs MariaDB's ~140K rows/s — 1.5× faster on bulk inserts. The gap comes
from three combined optimizations: O(P) heap writes via HeapChain::insert_batch,
O(1) WAL writes via record_insert_batch (Phase 3.17), and a single
parse+analyze pass for all N rows (Phase 4.16c). MariaDB pays a clustered B-Tree insert
per row plus UNDO log write before each page modification.
How to achieve this throughput in your application:
-- Fast: one SQL string with N value rows (211K rows/s)
INSERT INTO orders (user_id, amount) VALUES
(1, 49.99), (2, 12.50), (3, 99.00), -- ... up to thousands of rows
(1000, 7.99);
-- Slower: N separate INSERT strings (35K rows/s — parse+analyze per row)
INSERT INTO orders VALUES (1, 49.99);
INSERT INTO orders VALUES (2, 12.50);
-- ...
The difference between the two approaches is 6× in throughput. The bottleneck in the per-string case is parse + analyze overhead per SQL string (~20 µs/string), not the storage write.
Four-Engine Native Benchmark (2026-03-24)
All four engines measured locally on Apple M2 Pro, same machine, no Docker overhead,
10,000-row table (id BIGINT AUTO_INCREMENT PRIMARY KEY, name VARCHAR(100),
value INT). Each engine was given equivalent hardware resources.
Engines tested:
- MariaDB 12.1 — port 3306
- MySQL 8.0 — port 3310
- PostgreSQL 16 — port 5433
- AxiomDB — port 3309
| Operation | MariaDB 12.1 | MySQL 8.0 | PostgreSQL 16 | AxiomDB |
|---|---|---|---|---|
| INSERT batch (10K rows, 1 stmt) | 558 ms · 18K r/s | 628 ms · 16K r/s | 786 ms · 13K r/s | 275 ms · 36K r/s |
| SELECT * (10K rows, full scan) | 62 ms · 162K r/s | 53 ms · 189K r/s | 4 ms · 2.3M r/s | 47 ms · 212K r/s |
| DELETE (no WHERE, 10K rows) | 31 ms · 323K r/s | 407 ms · 25K r/s | 47 ms · 212K r/s | 9.6 ms · 1M r/s |
INSERT batch — 2× faster than MariaDB
AxiomDB reaches 36K r/s vs MariaDB’s 18K r/s (2× faster) and MySQL’s 16K r/s
(2.25× faster). The gap comes from the same three optimizations described above:
HeapChain::insert_batch() (O(P) page writes), record_insert_batch() (O(1) WAL
write), and a single parse+analyze pass for all N rows.
SELECT * — on par with MySQL, 11× behind PostgreSQL
AxiomDB SELECT (212K r/s) is marginally faster than MySQL 8.0 (189K r/s) and on par with the full-pipeline expectation. PostgreSQL’s 2.3M r/s reflects its shared buffer pool: after the first scan, all 10K rows fit in PostgreSQL’s hot in-memory buffer and subsequent queries never touch disk. AxiomDB’s mmap approach relies on the OS page cache for the same effect — the gap closes when pages are hot, but PostgreSQL’s buffer pool gives it an edge on repeated same-connection scans because it bypasses the OS cache layer entirely.
DELETE (no WHERE) — 3× faster than MariaDB, 40× faster than MySQL
AxiomDB deletes 10,000 rows in 9.6 ms (1M r/s). MariaDB takes 31 ms; MySQL 8.0 takes 407 ms. The AxiomDB advantage comes from two optimizations working together:
WalEntry::Truncate— a single 51-byte WAL entry replaces 10,000 per-row Delete entries. MySQL InnoDB writes one undo log record per row before marking it deleted — for 10K rows this is 10K undo writes plus 10K page modifications.HeapChain::delete_batch()— groups deletions by page, reads each page once, marks all slots dead, writes back once. 10K rows across 50 pages = 100 page operations instead of 30,000.
WalEntry::Truncate and processes all deletions in O(P) page I/O where P = number of pages ≈ 50 for 10K rows.
Row Codec Throughput
| Operation | Throughput | Notes |
|---|---|---|
| Encode row | 33M rows/s | 5-column row, mixed types |
| Decode row | 28M rows/s | Same row layout |
| encoded_len() | O(n) no alloc | Only computes the size, no buffer |
Row encoding is fast because:
- The codec iterates values once with a fixed dispatch per type.
- The null bitmap is written as bytes with bit shifts — no per-column branch on NULL.
- Variable-length types (Text, Bytes) use a 3-byte length prefix that avoids the 4-byte overhead of a full u32.
Why AxiomDB Is Fast — Architecture Reasons
1. No Double-Buffering
MySQL InnoDB maintains its own Buffer Pool in addition to the OS page cache. The same data lives in RAM twice.
MySQL: Disk → OS page cache → InnoDB Buffer Pool → Query
(copy 1) (copy 2)
AxiomDB: Disk → OS page cache → Query
(mmap — single copy)
AxiomDB uses mmap to map the .db file directly. The OS page cache IS the
buffer. When a page is hot, it is served from L2/L3 cache with zero copies.
2. No Double-Write Buffer
MySQL writes each 16 KB page to a special “doublewrite buffer” area on disk before writing it to its actual location. This prevents torn-page corruption but costs two disk writes per page.
AxiomDB uses a WAL + per-page CRC32c checksum. The WAL record is small (tens of bytes for the changed key-value pair). On recovery, AxiomDB replays the WAL to reconstruct any page that has a checksum mismatch. No doublewrite buffer needed.
3. Lock-Free Concurrent Reads
The Copy-on-Write B+ Tree uses an AtomicU64 to store the root page ID. Readers
load the root pointer with Acquire semantics and traverse the tree without acquiring
any lock. Writers swap the root pointer with Release semantics after finishing the
copy chain.
A running SELECT does not stall any INSERT or UPDATE. Both proceed in parallel.
4. Async I/O with Tokio
The server mode uses Tokio async I/O. 1,000 concurrent connections run on approximately 8 OS threads. MySQL’s thread-per-connection model requires 1,000 OS threads for 1,000 connections, consuming ~8 GB in stack space alone.
Performance Budget
The following table defines the minimum acceptable performance for each critical operation. Benchmarks that fall below the “acceptable maximum” column are treated as blockers before any phase is closed.
| Operation | Target | Acceptable maximum |
|---|---|---|
| Point lookup (PK) | 800K ops/s | 600K ops/s |
| Range scan 10K rows | 45 ms | 60 ms |
| B+ Tree INSERT with WAL (storage only) | 180K ops/s | 150K ops/s |
| INSERT end-to-end 10K batch (Phase 8) | 180K ops/s | 150K ops/s |
| SELECT via wire protocol (autocommit) | — | — |
| INSERT via wire protocol (autocommit) | — | — |
| Sequential scan 1M rows | 0.8 s | 1.2 s |
| Concurrent reads ×16 | linear | <2× degradation |
| Parser (simple SELECT) | 600 ns | 1 µs |
| Parser (complex SELECT) | 3 µs | 6 µs |
Index Usage Guide
Rules of Thumb
-
Every foreign key column needs an index — AxiomDB does not auto-index FK columns. Without an index, every FK check during DELETE/UPDATE scans the child table linearly.
-
Put the most selective column first in composite indexes — A query filtering
WHERE user_id = 42 AND status = 'paid'benefits most from(user_id, status)ifuser_idis more selective (fewer distinct values match). -
Covering indexes eliminate heap lookups — If all columns in a SELECT are in the index, AxiomDB returns results directly from the index without touching heap pages.
-
Partial indexes reduce size —
CREATE INDEX ... WHERE deleted_at IS NULLindexes only active rows. If 90% of rows are soft-deleted, the partial index is 10× smaller than a full index. -
BIGINT AUTO_INCREMENT beats UUID v4 for PK — UUID v4 inserts at random positions in the B+ Tree, causing ~40% more page splits than sequential integers. Use UUID v7 if you need UUIDs (time-sortable prefix).
Query Patterns to Avoid
Unindexed range scans on large tables
-- Slow: scans every row in orders (no index on placed_at)
SELECT * FROM orders WHERE placed_at > '2026-01-01';
-- Fix: create the index
CREATE INDEX idx_orders_date ON orders (placed_at);
Leading wildcard LIKE
-- Slow: cannot use index on 'name' (leading %)
SELECT * FROM users WHERE name LIKE '%smith%';
-- Better: full-text search index (planned Phase 8)
-- Acceptable workaround for small tables: use LOWER() + LIKE on indexed column
SELECT * with wide rows
-- Fetches all columns including large TEXT blobs for every row
SELECT * FROM documents WHERE category_id = 5;
-- Better: select only what the UI needs
SELECT id, title, created_at FROM documents WHERE category_id = 5;
NOT IN with nullable subquery
-- Returns 0 rows if the subquery contains a single NULL
SELECT * FROM orders WHERE user_id NOT IN (SELECT id FROM banned_users);
-- Fix: filter NULLs explicitly
SELECT * FROM orders WHERE user_id NOT IN (
SELECT id FROM banned_users WHERE id IS NOT NULL
);
Measuring Performance
EXPLAIN (planned)
EXPLAIN SELECT * FROM orders WHERE user_id = 42 ORDER BY placed_at DESC;
Running the Built-in Benchmarks
# B+ Tree benchmarks
cargo bench --bench btree -p axiomdb-index
# Storage engine benchmarks
cargo bench --bench storage -p axiomdb-storage
# Compare before/after an optimization
cargo bench -- --save-baseline before
# ... make change ...
cargo bench -- --baseline before
Benchmarks use Criterion.rs and report mean, standard deviation, and throughput
in a format compatible with critcmp for historical comparison.
Optimization Results — All-Visible Flag + Prefetch (2026-03-24)
Two storage-level optimizations implemented on branch research/pg-internals-comparison,
inspired by PostgreSQL internals analysis:
All-Visible Page Flag (optim-A)
After the first sequential scan on a stable table (all rows committed, none deleted),
AxiomDB sets bit 0 of PageHeader.flags. Subsequent scans skip per-slot MVCC
visibility tracking for those pages — 1 flag check per page instead of N per-slot
comparisons.
Impact on DELETE: scan_rids_visible() (used before batch delete) goes faster
because most pages are all-visible after INSERT → COMMIT. Measured improvement on
10K-row DELETE: 10ms → 7ms (+30%).
Sequential Scan Prefetch Hint (optim-C)
MmapStorage now calls madvise(MADV_SEQUENTIAL) before every sequential heap
scan. The OS kernel begins async read-ahead for following pages, overlapping I/O
with processing of the current page.
Impact: Measurable on cold-cache workloads (pages not in OS page cache). No regression on warm cache.
Benchmark after both optimizations (wire protocol, Apple M2 Pro)
| Operation | MariaDB 12.1 | MySQL 8.0 | AxiomDB | PostgreSQL 16 (warm) |
|---|---|---|---|---|
| INSERT batch 10K | 150ms · 67K r/s | 301ms · 33K r/s | 278ms · 36K r/s | 737ms · 14K r/s |
| SELECT * 10K | 53ms · 188K r/s | 48ms · 208K r/s | 49ms · 206K r/s | 5ms · 2.1M r/s |
| DELETE 10K (no WHERE) | 13ms · 779K r/s | 102ms · 98K r/s | 7ms · 1.4M r/s | 6ms · 1.6M r/s |
WalEntry::Truncate
(1 WAL entry instead of N) and the all-visible flag (skips MVCC scan overhead)
eliminates the two main costs in full-table deletion.
Architecture Overview
AxiomDB is organized as a Cargo workspace of purpose-built crates. Each crate has a single responsibility and depends only on crates below it in the stack. The layering prevents circular dependencies and makes each component independently testable.
Layer Diagram
┌─────────────────────────────────────────────────────────────────────┐
│ ENTRY POINTS │
│ │
│ axiomdb-server axiomdb-embedded │
│ (TCP daemon, (Rust API + C FFI, │
│ MySQL wire protocol) in-process library) │
└──────────────────────────────┬──────────────────────────────────────┘
│
┌──────────────────────────────▼──────────────────────────────────────┐
│ NETWORK LAYER │
│ │
│ axiomdb-network │
│ └── mysql/ │
│ ├── codec.rs (MySqlCodec — 4-byte packet framing) │
│ ├── packets.rs (HandshakeV10, HandshakeResponse41, OK, ERR) │
│ ├── auth.rs (mysql_native_password SHA1 + caching_sha2_password)│
│ ├── charset.rs (charset/collation registry, encode_text/decode_text)│
│ ├── session.rs (ConnectionState — typed charset fields, │
│ │ prepared stmt cache, pending long data) │
│ ├── handler.rs (handle_connection — async task per TCP conn) │
│ ├── result.rs (QueryResult → result-set packets, charset-aware)│
│ ├── error.rs (DbError → MySQL error code + SQLSTATE) │
│ └── database.rs (Arc<RwLock<Database>> wrapper) │
└──────────────────────────────┬──────────────────────────────────────┘
│
┌──────────────────────────────▼──────────────────────────────────────┐
│ QUERY PIPELINE │
│ │
│ axiomdb-sql │
│ ├── lexer (logos DFA, zero-copy tokens) │
│ ├── parser (recursive descent, LL(1)/LL(2)) │
│ ├── ast (Stmt, Expr, SelectStmt, InsertStmt, ...) │
│ ├── analyzer (BindContext, col_idx resolution, catalog lookup) │
│ ├── eval (expression evaluator, three-valued NULL logic, │
│ │ CASE WHEN searched + simple form, short-circuit) │
│ ├── result (QueryResult, ColumnMeta, Row — executor return type)│
│ ├── table (TableEngine — heap DML; clustered guard rails today)│
│ ├── index_integrity (startup index-vs-heap verifier; skips clustered)│
│ └── executor/ (mod.rs facade + select/insert/update/delete/ddl/ │
│ join/aggregate/shared modules; same execute() API; │
│ GROUP BY + HAVING + ORDER BY + LIMIT/OFFSET + │
│ INSERT … SELECT) │
│ │
│ [query planner, optimizer — Phase 6] │
└──────────────────────────────┬──────────────────────────────────────┘
│
┌──────────────────────────────▼──────────────────────────────────────┐
│ TRANSACTION LAYER │
│ │
│ axiomdb-mvcc (TxnManager, snapshot isolation, SSI) │
│ axiomdb-wal (WalWriter, WalReader, crash recovery) │
│ axiomdb-catalog (CatalogBootstrap, CatalogReader, schema) │
└──────────────────────────────┬──────────────────────────────────────┘
│
┌──────────────────────────────▼──────────────────────────────────────┐
│ INDEX LAYER │
│ │
│ axiomdb-index (BTree CoW, RangeIter, prefix compression) │
└──────────────────────────────┬──────────────────────────────────────┘
│
┌──────────────────────────────▼──────────────────────────────────────┐
│ STORAGE LAYER │
│ │
│ axiomdb-storage (StorageEngine trait, MmapStorage, │
│ MemoryStorage, FreeList, heap pages) │
└──────────────────────────────┬──────────────────────────────────────┘
│
┌──────────────────────────────▼──────────────────────────────────────┐
│ TYPE FOUNDATION │
│ │
│ axiomdb-types (Value, DataType, row codec) │
│ axiomdb-core (DbError, RecordId, TransactionSnapshot, │
│ PageId, LsnId, common types) │
└─────────────────────────────────────────────────────────────────────┘
│
┌──────────▼────────┐
│ axiomdb.db │ ← mmap pages (16 KB each)
│ axiomdb.wal │ ← WAL append-only log
└───────────────────┘
Crate Responsibilities
axiomdb-core
The dependency-free foundation. Contains:
DbError— the single error enum used by all other crates, usingthiserrordsn— shared DSN parser and typed normalized output:ParsedDsnWireEndpointDsnLocalPathDsn
RecordId— physical location of a row:(page_id: u64, slot_id: u16), 10 bytesTransactionSnapshot— snapshot ID and visibility predicate for MVCCPageId,LsnId— type aliases that document intent
No crate in the workspace depends on a crate above axiomdb-core.
axiomdb-core and lets each consumer validate only the subset it actually supports. This avoids duplicating URI logic in both axiomdb-server and axiomdb-embedded.
axiomdb-types
SQL value representation and binary serialization:
Value— the in-memory enum (Null,Bool,Int,BigInt,Real,Decimal,Text,Bytes,Date,Timestamp,Uuid)DataType— schema descriptor for a column’s type (mirrorsaxiomdb-core::DataTypebut with full type system including parameterized types)encode_row/decode_row— binary codec from&[Value]to&[u8]and backencoded_len— O(n) size computation without allocation
axiomdb-storage
The raw page I/O layer:
StorageEnginetrait —read_page,write_page,alloc_page,free_page,flushMmapStorage— maps the.dbfile withmemmap2; pages are directly accessible as&Pagereferences into the mapped regionMemoryStorage—Vec<Page>in RAM for tests and in-memory databasesFreeList— bitmap tracking free pages; scans left-to-right for the first free bitPage— 16 KB struct with 64-byte header (magic, type, checksum, page_id, LSN, free_start, free_end) and 16,320-byte body- Heap page format — slotted page with null bitmap and tuples growing from the end toward the beginning
- Same-slot tuple rewrite helpers — used by the stable-RID UPDATE path to overwrite a row in place when the new encoded row still fits inside the existing slot
axiomdb-index
The Copy-on-Write B+ Tree:
BTree— the public tree type; wraps aStorageEngineand anAtomicU64rootRangeIter— lazy iterator for range scans; traverses the tree to cross leaf boundariesInternalNodePage/LeafNodePage—#[repr(C)]structs withbytemuck::Podfor zero-copy serializationprefixmodule —CompressedNodefor in-memory prefix compression of internal keys
axiomdb-wal
Append-only Write-Ahead Log:
WalWriter— appendsWalEntryrecords with CRC32c checksums; manages file headerWalReader— stateless; opens a file handle per scan; supports both forward and backward iteration (backward scan usesentry_len_2at the tail of each record)WalEntry— binary-serializable record with LSN, txn_id, entry type, table_id, key, old_value, new_value, and checksumEntryType::UpdateInPlace— stable-RID same-slot UPDATE record used by rollback and crash recovery to restore the old tuple image at the same(page_id, slot_id)- Crash recovery state machine —
CRASHED → RECOVERING → REPLAYING_WAL → VERIFYING → READY
axiomdb-catalog
Schema persistence and lookup:
CatalogBootstrap— creates the three system tables (axiom_tables,axiom_columns,axiom_indexes) in the meta page on first openCatalogReader— reads schema from the system tables for use by the analyzer and executor; uses aTransactionSnapshotfor MVCC-consistent reads- Schema types:
TableDef,ColumnDef,IndexDef TableDefnow carriesroot_page_idplusTableStorageLayout::{Heap, Clustered}CatalogWriter::create_table_with_layout(...)allocates either a heap or clustered table root
axiomdb-mvcc
Transaction management and snapshot isolation:
TxnManager— assigns transaction IDs, tracks active transactions, assigns snapshots onBEGINRowHeader— embedded in each heap row:(xmin, xmax, deleted)for visibility- MVCC visibility function — determines whether a row version is visible to a snapshot
axiomdb-sql
The SQL processing pipeline:
lexer— logos-based DFA; ~85 tokens; zero-copy&'src stridentifiersast— all statement types:SelectStmt,InsertStmt,UpdateStmt,DeleteStmt,CreateTableStmt,CreateIndexStmt,DropTableStmt,DropIndexStmt,AlterTableStmtexpr—Exprenum for the expression tree:BinaryOp,UnaryOp,Column,Literal,IsNull,Between,Like,In,Case,Function,Param { idx: usize }(positional?placeholder resolved at execute time)parser— recursive descent; expression sub-parser with full operator precedence; parsesGROUP BY,HAVING,ORDER BYwithNULLS FIRST/LAST,LIMIT/OFFSET,SELECT DISTINCT,INSERT … SELECT, and both forms ofCASE WHENanalyzer—BindContext/BoundTable; resolvescol_idxfor JOINseval/— directory module rooted ateval/mod.rs; exports the same evaluator API as before, but splits internals intocontext.rs(collation and subquery runners),core.rs(recursiveExprevaluation),ops.rs(comparisons, boolean logic,IN,LIKE), andfunctions/(scalar built-ins by family)result—QueryResultenum (Rows/Affected/Empty),ColumnMeta(name, data_type, nullable, table_name),Row = Vec<Value>; the contract between the executor and all callers (embedded API, wire protocol, CLI)index_integrity— startup-time verification that compares every catalog-visible index against heap-visible rows after WAL recovery and rebuilds readable divergent indexes before open returns; clustered tables are currently skipped because their PRIMARY KEY metadata reuses the clustered rootexecutor/— directory module rooted atexecutor/mod.rs; the facade still exportsexecute,execute_with_ctx, andlast_insert_id_value, but the implementation is now split intoshared.rs,select.rs,joins.rs,aggregate.rs,insert.rs,update.rs,delete.rs,bulk_empty.rs,ddl.rs, andstaging.rs. Capabilities remain the same:GROUP BYwith hash-based aggregation (COUNT(*),COUNT(col),SUM,MIN,MAX,AVGwith proper NULL exclusion),HAVINGpost-filter,ORDER BYwith multi-column sort keys and per-columnNULLS FIRST/LASTcontrol,LIMIT n OFFSET mfor pagination,SELECT DISTINCTwith NULL-equality dedup (two NULL values are considered equal for deduplication), andINSERT … SELECTfor bulk copy and aggregate materialization- clustered tables now enter the catalog through
CREATE TABLE ... PRIMARY KEY ... 39.14adds a dedicated clusteredINSERTbranch inexecutor/insert.rs39.15adds a dedicated clusteredSELECTbranch inexecutor/select.rs39.16adds a dedicated clusteredUPDATEbranch inexecutor/update.rs39.17adds a dedicated clusteredDELETEbranch inexecutor/delete.rs39.18adds clusteredVACUUMmaintenance inaxiomdb-sql/src/vacuum.rs39.19adds legacy heap→clustered rebuild inexecutor/ddl.rs- Stable-RID UPDATE fast path — same-slot heap rewrite that preserves
RecordIdwhen the new encoded row fits and makes untouched-index skipping safe - UPDATE apply fast path — indexed UPDATE now batches candidate heap reads,
filters no-op rows before heap mutation, batches
UpdateInPlaceWAL append, and groups per-index delete+insert/root persistence on the remaining rows - Transactional INSERT staging — explicit transactions can buffer consecutive
INSERT ... VALUESrows inSessionContext, then flush them through one grouped heap/index pass at the next barrier statement orCOMMIT - Indexed multi-row INSERT batch path — the immediate
INSERT ... VALUES (...), (... )path now reuses the same grouped physical apply helpers as staged flushes even when the table has PRIMARY KEY or secondary indexes; the immediate path keeps strict same-statement UNIQUE checking and therefore does not reuse the stagedcommitted_emptyshortcut - clustered INSERT branch — explicit-PK tables now bypass heap staging entirely,
derive PK bytes from clustered primary-index metadata, write directly through
clustered_tree, maintain clustered secondary bookmarks, and make rollback delete undo keys from the current catalog root instead of trusting stale pre-split roots - clustered rebuild branch — legacy heap+PRIMARY KEY tables now rebuild into a fresh clustered root, rebuild secondaries as PK-bookmark indexes, flush those new roots, then swap catalog metadata and defer old-page free until commit
heap_multi_insert() and DuckDB's appender both inspired the shared grouped-write layer. AxiomDB adapts that physical apply pattern, but rejects reusing the staged bulk-load shortcut on immediate multi-row INSERT because duplicate keys inside one SQL statement must still fail atomically and before any partial batch becomes visible.
axiomdb-network
The MySQL wire protocol implementation. Lives in crates/axiomdb-network/src/mysql/:
| Module | Responsibility |
|---|---|
codec.rs | MySqlCodec — tokio_util framing codec; reads/writes the 4-byte header (u24 LE payload length + u8 sequence ID) |
packets.rs | Builders for HandshakeV10, HandshakeResponse41, OK, ERR, EOF; length-encoded integer/string helpers |
auth.rs | gen_challenge (20-byte CSPRNG), verify_native_password (SHA1-XOR), is_allowed_user allowlist |
charset.rs | Static charset/collation registry; decode_text/encode_text using encoding_rs; supports utf8mb4, utf8mb3, latin1 (cp1252), binary |
session.rs | ConnectionState — typed client_charset, connection_collation, results_collation fields; SET NAMES; decode_client_text/encode_result_text |
handler.rs | handle_connection — async task per TCP connection; explicit CONNECTED → AUTH → IDLE → EXECUTING → CLOSING lifecycle |
result.rs | serialize_query_result — QueryResult → column_count + column_defs + EOF + rows + EOF packets; charset-aware row encoding |
error.rs | dberror_to_mysql — maps every DbError variant to a MySQL error code + SQLSTATE |
database.rs | Database wrapper — owns storage + txn, runs WAL recovery and startup index verification, exposes execute_query |
Connection lifecycle
TCP accept
│
▼ (seq 0)
Server → HandshakeV10
│ 20-byte random challenge, capabilities, server version
│ auth_plugin_name = "caching_sha2_password"
│
▼ (seq 1)
Client → HandshakeResponse41
│ username, auth_response (SHA1-XOR token or caching_sha2 token),
│ capabilities, auth_plugin_name
│
▼ (seq 2) — two paths depending on the plugin negotiated:
│
│ mysql_native_password path:
│ └── Server → OK (permissive mode: username in allowlist → accepted)
│
│ caching_sha2_password path (MySQL 8.0+ default):
│ ├── Server → AuthMoreData(0x03) ← fast_auth_success indicator
│ ├── Client → empty ack packet ← pymysql sends this automatically
│ └── Server → OK
│
▼ COMMAND LOOP
│
├── COM_QUERY (0x03) → parse SQL → intercept? → execute → result packets
├── COM_PING (0x0e) → OK
├── COM_INIT_DB (0x02) → updates current_database in ConnectionState + OK
├── COM_RESET_CONNECTION (0x1f) → resets ConnectionState, preserves transport lifecycle metadata + OK
├── COM_STMT_PREPARE (0x16) → parse SQL with ? placeholders → stmt_ok packet
├── COM_STMT_SEND_LONG_DATA (0x18) → append raw bytes to stmt-local buffers, no reply
├── COM_STMT_EXECUTE (0x17) → merge long data + decode params → substitute → execute → result packets
├── COM_STMT_RESET (0x1a) → clear stmt-local long-data state → OK
├── COM_STMT_CLOSE (0x19) → remove from cache, no response
└── COM_QUIT (0x01) → close
Explicit lifecycle state machine (5.11c)
5.11c moved transport/runtime concerns out of ConnectionState into
mysql/lifecycle.rs. ConnectionState still owns SQL session variables,
prepared statements, warnings, and session counters. ConnectionLifecycle
owns only:
- current transport phase
- client capability flags relevant to lifecycle policy
- timeout policy per phase
- socket-level configuration (
TCP_NODELAY,SO_KEEPALIVE)
| Phase | Entered when | Timeout policy |
|---|---|---|
CONNECTED | socket accepted, before first packet | no read yet; greeting write uses auth timeout |
AUTH | handshake/auth exchange starts | fixed 10s auth timeout for reads/writes |
IDLE | between commands | interactive_timeout if CLIENT_INTERACTIVE, otherwise wait_timeout |
EXECUTING | after a command packet is accepted | packet writes use net_write_timeout; any future in-flight reads use net_read_timeout |
CLOSING | COM_QUIT, EOF, timeout, or transport error | terminal state before handler return |
COM_RESET_CONNECTION recreates ConnectionState::new() and resets session timeout
variables to their defaults, but it does not recreate ConnectionLifecycle. That
means the connection remains interactive or non-interactive according to the
original handshake, even after reset.
Prepared statements (prepared.rs)
Prepared statements allow a client to send SQL once and execute it many times with different parameters, avoiding repeated parsing and enabling binary parameter encoding that is more efficient than string escaping.
Protocol flow:
Client → COM_STMT_PREPARE (SQL with ? placeholders)
│
Server reads the SQL, counts ? placeholders, assigns a stmt_id.
│
Server → Statement OK packet
│ stmt_id: u32
│ num_columns: u16 (columns in the result set, or 0 for DML)
│ num_params: u16 (number of ? placeholders)
│ followed by num_params parameter-definition packets + EOF
│ followed by num_columns column-definition packets + EOF
│
Client → COM_STMT_SEND_LONG_DATA (optional, repeatable)
│ stmt_id: u32
│ param_id: u16
│ raw chunk bytes
│
Server appends raw bytes to stmt-local state, sends no response.
│
Client → COM_STMT_EXECUTE
│ stmt_id: u32
│ flags: u8 (0 = CURSOR_TYPE_NO_CURSOR)
│ iteration_count: u32 (always 1)
│ null_bitmap: ceil(num_params / 8) bytes (one bit per param)
│ new_params_bound_flag: u8 (1 = type list follows)
│ param_types: [u8; num_params * 2] (type byte + unsigned flag)
│ param_values: binary-encoded values for non-NULL params
│
Server → result set packets (same text-protocol format as COM_QUERY)
│
Client → COM_STMT_CLOSE (stmt_id) — no response expected
Binary parameter decoding (decode_binary_value):
Each parameter is decoded according to its MySQL type byte:
| MySQL type byte | Type name | Decoded as |
|---|---|---|
0x01 | TINY | i8 → Value::Int |
0x02 | SHORT | i16 → Value::Int |
0x03 | LONG | i32 → Value::Int |
0x08 | LONGLONG | i64 → Value::BigInt |
0x04 | FLOAT | f32 → Value::Real |
0x05 | DOUBLE | f64 → Value::Real |
0x0a | DATE | 4-byte packed date → Value::Date |
0x07 / 0x0c | TIMESTAMP / DATETIME | 7-byte packed datetime → Value::Timestamp |
0xfd / 0xfe / 0x0f | VAR_STRING / STRING / VARCHAR | lenenc bytes → Value::Text |
0xf9 / 0xfa / 0xfb / 0xfc | TINY_BLOB / MEDIUM_BLOB / LONG_BLOB / BLOB | lenenc bytes → Value::Bytes |
NULL parameters are identified by the null-bitmap before the type list is read;
they produce Value::Null without consuming any bytes from the value region.
Long-data buffering (COM_STMT_SEND_LONG_DATA):
PreparedStatement owns stmt-local pending buffers:
#![allow(unused)]
fn main() {
pub struct PreparedStatement {
// ...
pub pending_long_data: Vec<Option<Vec<u8>>>,
pub pending_long_data_error: Option<String>,
}
}
Rules:
- chunks are appended as raw bytes in
handler.rs COM_STMT_SEND_LONG_DATAnever takes theDatabasemutex- the next
COM_STMT_EXECUTEconsumes pending long data before inline values - long data wins over both the inline execute payload and the null bitmap
- state is cleared immediately after every execute attempt
COM_STMT_RESETclears only this long-data state, not the cached plan
AxiomDB follows MariaDB’s COM_STMT_SEND_LONG_DATA model here: accumulate raw
bytes per placeholder and decode them only at execute time. That keeps chunked
multibyte text correct without dragging the command through the engine path.
Parameter substitution — AST-level plan cache (substitute_params_in_ast):
COM_STMT_PREPARE runs parse + analyze once and stores the resulting Stmt in
PreparedStatement.analyzed_stmt. On each COM_STMT_EXECUTE, substitute_params_in_ast
walks the cached AST and replaces every Expr::Param { idx } node with
Expr::Literal(params[idx]) in a single O(n) tree walk (~1 µs), then calls
execute_stmt() directly — bypassing parse and analyze entirely.
The ? token is recognized by the lexer as Token::Question and emitted by the parser
as Expr::Param { idx: N } (0-based position). The semantic analyzer passes Expr::Param
through unchanged because the type is not yet known; type resolution happens at execute
time once the binary-encoded parameter values are decoded from the COM_STMT_EXECUTE
packet.
value_to_sql_literal converts each decoded Value to the appropriate Expr::Literal
variant:
Value::Null→Expr::Literal(Value::Null)Value::Int/BigInt/Real→ numeric literal nodeValue::Text→ text literal node (single-quote escaping preserved at the protocol boundary, not needed in the AST)Value::Date/Timestamp→ date/timestamp literal node
? markers in the original SQL text and then running the full parse + analyze
pipeline on each COM_STMT_EXECUTE call (~1.5 µs per execution). Phase 5.13
replaces this with an AST-level plan cache: parse + analyze run once at
COM_STMT_PREPARE time; each execute performs only a tree walk to splice in
the decoded parameter values (~1 µs). MySQL and PostgreSQL use the same strategy —
parsing and planning are separated from execution precisely so that repeated executions
avoid repeated parse overhead.
COM_STMT_EXECUTE responses use the same text-protocol result-set format as
COM_QUERY (column defs + EOF + text-encoded rows + EOF), not the MySQL
binary result-set format. The binary result-set format requires a separate
CLIENT_PS_MULTI_RESULTS serialization path for every column type and adds
substantial protocol complexity with marginal benefit for typical workloads. The
text-protocol response is fully accepted by PyMySQL, SQLAlchemy, and the mysql
CLI. Binary result-set serialization is deferred to subphase 5.5a when a concrete
performance need arises.
ConnectionState — per-connection session state:
#![allow(unused)]
fn main() {
pub struct ConnectionState {
pub current_database: String,
pub autocommit: bool,
// Typed charset state — negotiated at handshake, updated by SET NAMES
client_charset: &'static CharsetDef,
connection_collation: &'static CollationDef,
results_collation: &'static CollationDef,
pub variables: HashMap<String, String>,
pub prepared_statements: HashMap<u32, PreparedStatement>,
pub next_stmt_id: u32,
}
}
The three charset fields are typed references into the static charset.rs registry.
from_handshake_collation_id(id: u8) initializes all three from the collation id the
client sends in the HandshakeResponse41 packet. Unsupported ids are rejected before
auth with ERR 1115 (ER_UNKNOWN_CHARACTER_SET). SET NAMES <charset> updates all three;
individual SET character_set_client = … updates only the relevant field.
decode_client_text(&[u8]) -> Result<String, DbError> decodes inbound SQL/identifiers.
encode_result_text(&str) -> Result<Vec<u8>, DbError> encodes outbound text columns.
Both are non-lossy — they return DbError::InvalidValue rather than replacement characters.
client_encoding / server-encoding split, but without the per-column collation
complexity that PostgreSQL adds. All AxiomDB storage is UTF-8; charset negotiation is
purely a wire-layer concern.
#![allow(unused)]
fn main() {
pub struct PreparedStatement {
pub stmt_id: u32,
pub sql_template: String, // original SQL with ? placeholders
pub param_count: u16,
pub analyzed_stmt: Option<Stmt>, // cached parse+analyze result (plan cache)
pub compiled_at_version: u64, // global schema_version at compile time
pub deps: PlanDeps, // per-table OID dependencies (Phase 40.2)
pub generation: u32, // incremented on each re-analysis
pub last_used_seq: u64,
pub pending_long_data: Vec<Option<Vec<u8>>>,
pub pending_long_data_error: Option<String>,
}
}
analyzed_stmt is populated by COM_STMT_PREPARE after parse + analyze succeed. On
COM_STMT_EXECUTE, if analyzed_stmt is Some, the handler calls
substitute_params_in_ast on the cached Stmt and invokes execute_stmt() directly,
skipping the parse and analyze steps entirely. If analyzed_stmt is None (should not
occur in normal operation), the handler falls back to the full parse + analyze path.
OID-based staleness check (Phase 40.2):
COM_STMT_EXECUTE uses a two-level check:
- Fast (
O(1)atomic compare): ifcompiled_at_version == current_global_schema_version, no DDL has occurred since compile → skip catalog scan entirely (zero I/O). - Slow (
O(t)catalog reads,t = tables in deps): only when the global version has advanced.PlanDeps::is_stale()reads each table’s currentschema_versionfrom the catalog heap and compares to the cached snapshot. If all match → the DDL was on a different table → stamp the new global version and skip re-analysis.
This avoids re-analyzing prepared statements when CREATE INDEX ON other_table runs —
only statements that actually reference the DDL-modified table are re-compiled. PostgreSQL
uses the same approach via RelationOids in CachedPlanSource.
Each connection maintains its own HashMap<u32, PreparedStatement>. Statement IDs are
assigned by incrementing next_stmt_id (starting at 1) and are local to the connection
— the same ID on two connections refers to two different statements. COM_STMT_CLOSE
removes the entry; subsequent COM_STMT_EXECUTE calls for the closed ID return an
Unknown prepared statement error. COM_STMT_RESET leaves the entry in place and
clears only the stmt-local long-data buffers plus any deferred long-data error.
Packet framing and size enforcement (codec.rs — subphase 5.4a)
Every MySQL message in both directions — client to server and server to client — uses the same 4-byte envelope:
[payload_length: u24 LE] [sequence_id: u8] [payload: payload_length bytes]
MySqlCodec implements tokio_util::codec::{Decoder, Encoder}. It holds a
configurable max_payload_len (default 64 MiB) that matches the session variable
@@max_allowed_packet.
Two-phase decoder algorithm:
- Scan phase — walk physical packet headers without consuming bytes, accumulating
total_payload. Iftotal_payload > max_payload_len, returnMySqlCodecError::PacketTooLarge { actual, max }before any buffer allocation. If any fragment is missing, returnOk(None)(backpressure). - Consume phase — advance the buffer and return
(seq_id, Bytes). For a single physical fragment this is a zero-copysplit_tointo the existingBytesMut. For multi-fragment logical packets one contiguousBytesMutis allocated withcapacity = total_payloadto avoid per-fragment copies.
Multi-packet reassembly. MySQL splits commands larger than 16,777,215 bytes
(0xFF_FFFF) across multiple physical packets. A fragment with
payload_length = 0xFF_FFFF signals continuation; the final fragment has
payload_length < 0xFF_FFFF. The limit applies to the reassembled logical payload,
not to each individual fragment.
Live per-connection limit. handle_connection calls
reader.decoder_mut().set_max_payload_len(n):
- After auth (from
conn_state.max_allowed_packet_bytes()) - After a valid
SET max_allowed_packet = N - After
COM_RESET_CONNECTION(restoresDEFAULT_MAX_ALLOWED_PACKET)
Oversize behavior. On PacketTooLarge, the handler sends MySQL ERR
1153 / SQLSTATE 08S01 (“Got a packet bigger than ‘max_allowed_packet’ bytes”) and
breaks the connection loop. The stream is never re-used — re-synchronisation after an
oversize packet is unsafe.
MySqlCodec::decode(), before the payload reaches
UTF-8 decoding, SQL parsing, or binary-protocol decoding. MySQL 8 and MariaDB enforce
max_allowed_packet at the network I/O layer for the same reason: a SQL
parser that receives an oversized payload has already spent memory allocating it.
Rejecting at the codec boundary means zero heap allocation for oversized inputs.
Result set serialization (result.rs — subphase 5.5a)
AxiomDB has two result serializers sharing the same column_count + column_defs + EOF
framing but differing in row encoding:
| Serializer | Used for | Row format |
|---|---|---|
serialize_query_result | COM_QUERY | Text protocol — NULL = 0xfb, values as lenenc ASCII strings |
serialize_query_result_binary | COM_STMT_EXECUTE | Binary protocol — null bitmap + fixed-width/lenenc values |
Both paths produce the same packet sequence shape:
column_count (lenenc integer)
column_def_1 (lenenc strings: catalog, schema, table, org_table, name, org_name
+ 12-byte fixed section: charset, display_len, type_byte, flags, decimals)
…
column_def_N
EOF
row_1
…
row_M
EOF
Binary row packet layout:
0x00 row header (always)
null_bitmap[ceil((N+2)/8)] MySQL offset-2 null bitmap: column i → bit (i+2)
value_0 ... value_k non-null values in column order (no per-cell headers)
The null bitmap uses MySQL’s prepared-row offset of 2 — bits 0 and 1 are reserved. Column 0 → bit 2, column 1 → bit 3, and so on.
Binary cell encoding per type:
| AxiomDB type | Encoding |
|---|---|
Bool | 1 byte: 0x00 or 0x01 |
Int | 4-byte signed LE |
BigInt | 8-byte signed LE |
Real | 8-byte IEEE-754 LE (f64) |
Decimal | lenenc ASCII decimal string (exact, no float rounding) |
Text | lenenc UTF-8 bytes |
Bytes | lenenc raw bytes (no UTF-8 conversion) |
Date | [4][year u16 LE][month u8][day u8] |
Timestamp | [7][year u16 LE][month][day][h][m][s] or [11][...][micros u32 LE] |
Uuid | lenenc canonical UUID string |
Column type codes (shared between both serializers):
| AxiomDB type | MySQL type byte | MySQL name |
|---|---|---|
Int | 0x03 | LONG |
BigInt | 0x08 | LONGLONG |
Real | 0x05 | DOUBLE |
Decimal | 0xf6 | NEWDECIMAL |
Text | 0xfd | VAR_STRING |
Bytes | 0xfc | BLOB |
Bool | 0x01 | TINY |
Date | 0x0a | DATE |
Timestamp | 0x07 | TIMESTAMP |
Uuid | 0xfd | VAR_STRING |
build_column_def() function
and one datatype_to_mysql_type() mapping. This guarantees that the type
byte in column metadata always agrees with the wire encoding of the row values.
A divergence (e.g., advertising LONGLONG but sending ASCII digits) would
cause silent data corruption on the client — a class of bug that is impossible when
there is only one mapping.
COM_QUERY OID-based plan cache (plan_cache.rs — Phase 40.2)
Repeated ad-hoc queries like SELECT * FROM users WHERE id = 42 arrive with different
literal values on each call. The plan cache normalizes literals to ? placeholders,
hashes the result, and caches the fully analyzed AST. Subsequent queries with the same
structure (e.g., id = 99) skip parse + analyze (~5 µs) and reuse the cached Stmt.
Entry structure (CachedPlanSource):
#![allow(unused)]
fn main() {
struct CachedPlanSource {
stmt: Stmt, // fully analyzed AST
deps: PlanDeps, // (table_id, schema_version) per referenced table
param_count: usize, // expected literal count for structural match
generation: u32, // incremented on each re-store after stale eviction
exec_count: u64, // lifetime hit counter
last_used_seq: u64, // LRU clock value
last_validated_global_version: u64, // fast pre-check stamp
}
}
Two-level staleness check:
- Fast (
O(1)): ifglobal_schema_version == last_validated_global_version, no DDL has occurred since last validation → cache hit with zero catalog I/O. - Slow (
O(t)catalog reads): called only when the global version advanced.PlanDeps::is_stale()reads each table’s currentschema_versionfrom the catalog heap and compares to the cached snapshot. If any dep mismatches → evict. If all match → stamp the new global version (future lookups hit the fast path again).
Belt-and-suspenders invalidation:
- Lazy (primary):
is_stale()at lookup time catches cross-connection DDL. - Eager (secondary):
invalidate_table(table_id)called immediately after same-connection DDL removes all entries whosedepsincludetable_id. DDL functions inexecutor/ddl.rsalso callbump_table_schema_version(table_id)viaCatalogWriterso the per-table counter advances regardless of which connection holds the plan.
OID dependency extraction (plan_deps.rs):
extract_table_deps(stmt, catalog_reader, database) walks the analyzed Stmt and
resolves every table reference to its (TableId, schema_version) at compile time:
SELECT— FROM, JOINs, scalar subqueries in WHERE/HAVING/columns/ORDER BY/GROUP BYINSERT … SELECT— target table + all tables in the SELECTUPDATE,DELETE— target table + subqueries in WHEREEXPLAIN— recursive into the wrapped statement- DDL statements — return empty
PlanDeps(never cached)
LRU eviction: when max_entries (512) is reached, the entry with the lowest
last_used_seq is evicted. O(n) scan over ≤512 entries — called only on capacity overflow,
never on the hot lookup path.
plancache.c uses per-entry RelationOids to limit invalidation
to plans that reference the modified table. AxiomDB mirrors PostgreSQL's approach:
a CREATE INDEX ON users(email) evicts only plans that reference users
— plans on orders, products, and other tables survive untouched.
ORM query interception (handler.rs)
MySQL drivers and ORMs send several queries automatically before any user SQL:
SET NAMES, SET autocommit, SELECT @@version, SELECT @@version_comment,
SELECT DATABASE(), SELECT @@sql_mode, SELECT @@lower_case_table_names,
SELECT @@max_allowed_packet, SHOW WARNINGS, SHOW DATABASES.
intercept_special_query matches these by prefix/content and returns pre-built
packet sequences without touching the engine. Without this interception, most clients
fail to connect because they receive ERR packets for mandatory queries.
ON_ERROR session behavior (executor.rs, database.rs, subphase 5.2c)
ON_ERROR is implemented as one typed session enum shared by both layers that
own statement execution:
| Layer | State owner | Responsibility |
|---|---|---|
| SQL executor | SessionContext.on_error | Controls rollback policy for executor-time failures |
| Wire/session layer | ConnectionState.on_error | Exposes SET on_error, @@on_error, SHOW VARIABLES, and reset semantics |
This split is required by the current AxiomDB architecture. handler.rs
intercepts SET and SELECT @@var before the engine, but database.rs owns
the full parse -> analyze -> execute_with_ctx pipeline. A wire-only flag would
leave embedded execution inconsistent; an executor-only flag would make the MySQL
session variables lie.
Execution modes:
| Mode | Active transaction error | First failing DML with autocommit=0 | Parse/analyze failure |
|---|---|---|---|
rollback_statement | rollback to statement boundary, txn stays open | full rollback, txn closes | return ERR, txn state unchanged |
rollback_transaction | eager full rollback, txn closes | eager full rollback, txn closes | eager full rollback if txn active |
savepoint | same as rollback_statement | keep implicit txn open after rolling back the failing DML | return ERR, txn state unchanged |
ignore | ignorable SQL errors -> warning + continue; non-ignorable runtime errors -> eager full rollback + ERR | ignorable SQL errors -> warning + continue; non-ignorable runtime errors -> eager full rollback + ERR | same split as active txn |
ignore reuses the existing SHOW WARNINGS path. For ignorable SQL/user
errors, database.rs maps the original DbError to the corresponding MySQL
warning code/message and returns QueryResult::Empty, which the serializer
turns into an OK packet with warning_count > 0. For non-ignorable errors
(DiskFull, WAL failures, storage/runtime corruption), the error still
surfaces as ERR and the transaction is eagerly rolled back if one is active.
SHOW STATUS — server and session counters (status.rs, subphase 5.9c)
MySQL clients, ORMs, and monitoring tools (PMM, Datadog MySQL integration, ProxySQL)
call SHOW STATUS on connect or periodically to query server health. Returning an
error or empty result breaks these integrations.
Counter architecture:
Two independent counter stores keep telemetry decoupled from correctness:
| Store | Type | Scope | Reset policy |
|---|---|---|---|
StatusRegistry | Arc<StatusRegistry> with AtomicU64 fields | Server-wide, shared across all connections | Only on server restart |
SessionStatus | Plain u64 fields inside ConnectionState | Per-connection | On COM_RESET_CONNECTION (which recreates ConnectionState) |
Database owns an Arc<StatusRegistry>. Each handle_connection task clones
the Arc once at connect time — the same pattern used by schema_version. The
SHOW STATUS intercept never acquires the Database mutex; it reads directly
from the cloned Arc<StatusRegistry> and the local SessionStatus. This means
the query cannot block other connections.
RAII guards:
#![allow(unused)]
fn main() {
// Increments threads_connected +1 after auth; drops −1 on disconnect (even on error).
let _connected_guard = ConnectedGuard::new(Arc::clone(&status));
// Increments threads_running +1 for the duration of COM_QUERY / COM_STMT_EXECUTE.
let _running = RunningGuard::new(&status);
}
threads_connected and threads_running are always accurate with no manual bookkeeping
because Rust’s drop guarantees run on early returns and panics.
Counters tracked:
| Variable name | Scope | Description |
|---|---|---|
Bytes_received | Session + Global | Bytes received from client (payload + 4-byte header) |
Bytes_sent | Session + Global | Bytes sent to client |
Com_insert | Session + Global | INSERT statement count |
Com_select | Session + Global | SELECT statement count |
Innodb_buffer_pool_read_requests | Global | Best-effort mmap access counter |
Innodb_buffer_pool_reads | Global | Physical page reads (compatibility alias) |
Questions | Session + Global | All statements executed (any command type) |
Threads_connected | Global | Active authenticated connections |
Threads_running | Session + Global | Connections actively executing a command |
Uptime | Global | Seconds since server start |
SHOW STATUS syntax:
All four MySQL-compatible forms are intercepted before hitting the engine:
SHOW STATUS
SHOW SESSION STATUS
SHOW LOCAL STATUS
SHOW GLOBAL STATUS
-- Any of the above with LIKE filter:
SHOW STATUS LIKE 'Com_%'
SHOW GLOBAL STATUS LIKE 'Threads%'
LIKE filtering reuses like_match from axiomdb-sql (proper % / _ wildcard
semantics, case-insensitive against variable names). Results are always returned in
ascending alphabetical order.
SHOW STATUS reads AtomicU64 counters directly from a cloned
Arc — it never acquires the Database mutex. MySQL InnoDB
reads status from the engine layer, which requires acquiring internal mutexes under
high concurrency. AxiomDB's design means monitoring queries cannot interfere with
query execution at any load level.
DB lock strategy
The MySQL handler stores the opened engine in Arc<tokio::sync::RwLock<Database>>.
- read-only statements acquire
db.read() - mutating statements and transaction control acquire
db.write() - multiple reads run concurrently
- all writes are still serialized at whole-database granularity
This is the current runtime model. It is more advanced than the old Phase 5
Mutex<Database> design because read-only queries can now overlap, but it is
still below MySQL/InnoDB and PostgreSQL for write concurrency because row-level
locking is not implemented yet.
mysql_native_password SHA1 challenge-response
handshake (the same algorithm used by MySQL 5.x clients) but ignores the password
result for users in the allowlist (root, axiomdb, admin).
This lets any MySQL-compatible client connect during development without credential
management. The verify_native_password function is fully correct — it is
called and its result logged — but the decision to accept or reject is based solely
on the username allowlist until Phase 13 (Security) adds stored credentials and real
enforcement.
caching_sha2_password (MySQL 8.0+)
MySQL 8.0 changed the default authentication plugin from mysql_native_password to
caching_sha2_password. When a client using the new default (e.g., PyMySQL ≥ 1.0,
MySQL Connector/Python, mysql2 for Ruby) connects, the server must complete a 5-packet
handshake instead of the 3-packet one:
| Seq | Direction | Packet | Notes |
|---|---|---|---|
| 0 | S → C | HandshakeV10 | includes 20-byte challenge |
| 1 | C → S | HandshakeResponse41 | auth_plugin_name = "caching_sha2_password" |
| 2 | S → C | AuthMoreData(0x03) | fast_auth_success — byte 0x03 signals that password verification is skipped in permissive mode |
| 3 | C → S | empty ack | client acknowledges the fast-auth signal before expecting OK |
| 4 | S → C | OK | connection established |
The critical implementation detail is that the ack packet at seq=3 must be read
before sending OK. If the server sends OK at seq=2 instead, the client has already
queued the empty ack packet. The server then reads that empty packet as a COM_QUERY
command (command byte 0x00 = COM_SLEEP, or simply an unknown command), which causes
the connection to close silently — no error is reported to the application.
AuthMoreData(fast_auth_success).
If the server skips reading that ack and sends OK immediately at seq=2, the client's
buffered ack arrives in the command loop, where it is misread as a COM_QUERY
(command byte 0x00 = COM_SLEEP). The connection closes silently with no error
visible to the application. The fix is one extra read_packet() call before
writing OK.
axiomdb-server
Entry point for server mode. Parses CLI flags (--data-dir, --port), opens the
axiomdb-network::Database, starts a Tokio TCP listener, and spawns one
handle_connection task per accepted connection, passing each task a clone of the
Arc<RwLock<Database>>.
axiomdb-embedded
Entry point for embedded mode. Exposes:
- A safe Rust API (
Database::open,Database::execute,Database::transaction) - A C FFI (
axiomdb_open,axiomdb_execute,axiomdb_close,axiomdb_free_string)
Query Lifecycle — From Wire to Storage
1. TCP bytes arrive on the socket
│
2. axiomdb-network::mysql::codec::MySqlCodec decodes the 4-byte header
→ (sequence_id, payload)
│
3. handler.rs inspects payload[0] (command byte)
├── 0x01 COM_QUIT → close
├── 0x02 COM_INIT_DB → OK
├── 0x0e COM_PING → OK
├── 0x16 COM_STMT_PREPARE → parse + analyze → store in PreparedStatement.analyzed_stmt → stmt_ok
├── 0x17 COM_STMT_EXECUTE → substitute_params_in_ast(cached_stmt, params) → execute_stmt() ↓ (step 9)
└── 0x03 COM_QUERY → continue ↓
│
4. intercept_special_query(sql) — ORM/driver stubs
├── match → return pre-built packet sequence (no engine call)
└── no match → continue ↓
│
5. db.lock() → execute_query(sql, &mut session)
│
6. axiomdb-sql::tokenize(sql)
→ Vec<SpannedToken> (logos DFA, zero-copy)
│
7. axiomdb-sql::parse(tokens)
→ Stmt (recursive descent; all col_idx = placeholder 0)
│
8. axiomdb-sql::analyze(stmt, storage, snapshot)
→ Stmt (col_idx resolved against catalog; names validated)
│
9. Executor interprets the analyzed Stmt
→ reads from axiomdb-index (BTree lookups / range scans)
→ calls axiomdb-types::decode_row on heap page bytes
→ builds Vec<Vec<Value>> result rows
│
10. WAL write (for INSERT / UPDATE / DELETE)
→ axiomdb-wal::WalWriter::append(WalEntry)
│
11. Heap page write (for INSERT / UPDATE / DELETE)
→ axiomdb-storage::StorageEngine::write_page
│
12. db.lock() released
│
13. result::serialize_query_result(QueryResult, seq=1)
→ column_count + column_defs + EOF + rows + EOF (Rows)
→ OK packet with affected_rows + last_insert_id (Affected)
│
14. MySqlCodec encodes each packet with 4-byte header → TCP send
For embedded mode, steps 1–4 and 12–14 are replaced by a direct Rust function call
that returns a QueryResult struct.
Key Architectural Decisions
mmap over a custom buffer pool
AxiomDB maps the .db file with mmap. The OS page cache manages eviction (LRU) and
readahead automatically. InnoDB maintains a separate buffer pool on top of the OS page
cache, causing the same data to live in RAM twice. mmap eliminates the second copy.
Trade-off: we give up fine-grained control over eviction policy. The OS uses LRU, which is good for most database workloads. Custom eviction (e.g., clock-sweep with hot/cold separation) will be optional in a future phase.
Copy-on-Write B+ Tree
CoW means a write operation never modifies an existing page in place. Instead, it creates new pages for every node on the path from root to the modified leaf, then atomically swaps the root pointer. Readers who loaded the old root before the swap continue accessing a fully consistent old version with no locking.
Trade-off: writes amplify — modifying one leaf requires copying O(log n) pages. For a tree of depth 4 (enough for hundreds of millions of rows), this is 4 page copies per write. At 16 KB per page, that is 64 KB of write amplification per key insert.
WAL without double-write
The WAL records logical changes (key, old_value, new_value) rather than full page images. Each WAL record has a CRC32c checksum. On recovery, AxiomDB reads the WAL forward, identifies committed transactions, and replays their mutations. Pages with incorrect checksums are rebuilt from WAL records.
This eliminates MySQL’s doublewrite buffer (which writes each page twice to protect against torn writes) at the cost of a slightly more complex recovery algorithm.
logos for lexing, not nom
logos generates a compiled DFA from the token patterns at build time. The generated lexer runs in O(n) time with a fixed, small constant (typically 1–3 CPU instructions per byte). nom builds parser combinators at runtime with dynamic dispatch overhead. For a lexer processing millions of SQL statements per second, the constant factor matters: logos achieves 9–17× throughput over sqlparser-rs’s nom-based lexer.
Storage Engine
The storage engine is the lowest user-accessible layer in AxiomDB. It manages raw 16-kilobyte pages on disk or in memory, provides a freelist for page allocation, and exposes a simple trait that all higher layers depend on.
The StorageEngine Trait
#![allow(unused)]
fn main() {
pub trait StorageEngine: Send + Sync {
fn read_page(&self, page_id: u64) -> Result<PageRef, DbError>;
fn write_page(&self, page_id: u64, page: &Page) -> Result<(), DbError>;
fn alloc_page(&self, page_type: PageType) -> Result<u64, DbError>;
fn free_page(&self, page_id: u64) -> Result<(), DbError>;
fn flush(&self) -> Result<(), DbError>;
fn page_count(&self) -> u64;
fn prefetch_hint(&self, start_page_id: u64, count: u64) { ... }
fn set_current_snapshot(&self, snapshot_id: u64) { ... }
fn deferred_free_count(&self) -> usize { ... }
}
}
All methods take &self — there is no &mut self anywhere in the trait. Mutable state
is managed entirely through interior mutability:
write_page: acquires a per-page exclusiveRwLock(fromPageLockTable) for the duration of thepwrite(2)call. Two transactions writing different pages proceed in full parallelism with zero contention.alloc_page: acquiresMutex<FreeList>only during the bitmap scan (microseconds), then acquires the page lock to initialise the new page.free_page: acquiresMutex<FreeList>briefly to add the page to the free bitmap.flush: acquiresMutex<FreeList>to persist the freelist, then callsfdatasync.
This design mirrors InnoDB (buf_page_get_gen with per-page block_lock, no &mut on
the buffer pool) and PostgreSQL (per-buffer atomic state field, MarkBufferDirty is
&self-equivalent).
&self-equivalent buffer pool with per-page locks.
AxiomDB follows the same pattern: a sharded PageLockTable (64 shards, one
RwLock<HashMap<u64, Arc<RwLock<()>>>> per shard) eliminates the global
&mut self bottleneck and is the architectural unlock for concurrent writer support
in phases 40.4–40.12.
read_page returns an owned PageRef — a heap-allocated copy of the 16 KB page data.
This is a deliberate change from the original &Page borrow: owned pages survive mmap
remaps (during grow()) and page reuse (after free_page), which is essential for
concurrent read/write access. The copy cost is ~0.5 us from L2/L3 cache — the same cost
PostgreSQL pays when copying a page from the buffer pool into backend-local memory.
Page Format
Every page is exactly PAGE_SIZE = 16,384 bytes (16 KB). The first HEADER_SIZE = 64
bytes are the page header; the remaining PAGE_BODY_SIZE = 16,320 bytes are the body.
Page Header — 64 bytes
Offset Size Field Description
──────── ────── ──────────────── ──────────────────────────────────────
0 8 magic `PAGE_MAGIC` — identifies valid pages
8 1 page_type PageType enum (see below)
9 1 flags page flags (`PAGE_FLAG_ALL_VISIBLE`, future bits)
10 2 item_count item/slot count for the page-local format
12 4 checksum CRC32c of body bytes `[HEADER_SIZE..PAGE_SIZE]`
16 8 page_id This page's own ID (self-identifying)
24 8 lsn Log Sequence Number of last write
32 2 free_start First free byte offset in the body (format-specific)
34 2 free_end Last free byte offset in the body (format-specific)
36 28 _reserved Future use
Total: 64 bytes
The CRC32c checksum covers only the page body [HEADER_SIZE..PAGE_SIZE], not the
header itself. On every read_page, AxiomDB verifies the checksum and returns
DbError::ChecksumMismatch if it fails.
Page Types
#![allow(unused)]
fn main() {
pub enum PageType {
Meta = 0, // page 0: database header + catalog roots
Data = 1, // heap pages holding table rows
Index = 2, // current fixed-slot B+ Tree internal and leaf nodes
Overflow = 3, // continuation pages for large values
Free = 4, // freelist / unused pages
ClusteredLeaf = 5, // slotted clustered leaf: full PK row inline
ClusteredInternal = 6, // slotted clustered internal: varlen separators
}
}
Clustered Page Primitives (Phase 39.1 / 39.2 / 39.3)
The clustered index rewrite is landing in the storage layer first. Two new page types now exist even though the SQL executor still uses the classic heap + secondary-index path:
ClusteredLeaf— slotted page with variable-size cells storing:key_lenrow_len- inline
RowHeader - primary-key bytes
- row payload bytes
ClusteredInternal— slotted page with variable-size separator cells storing:right_childkey_len- separator key bytes
ClusteredInternal keeps one extra child pointer in the header as
leftmost_child, so logical child access still follows the classical B-tree
rule n keys -> n + 1 children.
ClusteredInternal body:
[16B header: is_leaf | num_cells | cell_content_start | freeblock_offset | leftmost_child]
[cell pointer array]
[free gap]
[cells: right_child | key_len | key_bytes]
That design keeps the storage primitive compatible with the current traversal contract:
find_child_idx(search_key)returns the first separator strictly greater than the keychild_at(0)readsleftmost_childchild_at(i > 0)reads theright_childof separator celli - 1
SQL-Visible Clustered DDL + INSERT Boundary (Phases 39.13 / 39.14)
The storage rewrite is no longer purely internal. CREATE TABLE now uses the
clustered root when the SQL definition contains an explicit PRIMARY KEY:
TableDef.root_page_idis the generic primary row-store rootTableDef.storage_layouttells higher layers whether that root is heap or clustered- heap tables still allocate
PageType::Data - clustered tables now allocate
PageType::ClusteredLeaf - logical PRIMARY KEY metadata on clustered tables points at that same clustered root
The first SQL-visible clustered write paths now exist too:
INSERTon explicit-PRIMARY KEYtables routes directly intoclustered_tree::insert(...)orrestore_exact_row_image(...)- clustered
AUTO_INCREMENTbootstraps from clustered rows instead of heap scans - non-primary clustered indexes are maintained as PK bookmarks through
axiomdb-sql::clustered_secondary SELECTon clustered tables now routes throughclustered_tree::lookup(...)/range(...)and decodes clustered secondary bookmarks back into PK probesUPDATEon clustered tables now routes through clustered candidate discovery plusupdate_in_place(...)/update_with_relocation(...)DELETEon clustered tables now routes through clustered candidate discovery plusdelete_mark(...)and exact-row-image WAL- pending heap batches flush before the clustered statement boundary so the new clustered branch does not inherit heap staging semantics accidentally
SQL-visible clustered maintenance is now partially live:
- clustered
VACUUMnow physically purges safe dead rows and overflow chains ALTER TABLE ... REBUILDnow migrates legacy heap+PRIMARY KEY tables into a fresh clustered root and rebuilt clustered-secondary bookmark roots- clustered standalone
CREATE INDEX/ANALYZEremain later Phase 39 work
Clustered maintenance now includes the first purge path:
VACUUMwalks the clustered leaf chain from the leftmost leaf- safe delete-marked cells are physically removed from clustered leaves
- overflow chains are freed during that purge
- secondary bookmark cleanup uses clustered physical existence after leaf purge, not caller-snapshot visibility
- any secondary root rotation caused by
delete_many_in(...)is persisted back to the catalog in the same transaction - clustered rebuild flushes the newly built clustered / secondary roots before the catalog swap and defers old heap/index page reclamation until commit
WITHOUT ROWID inserts target the PK B-tree directly, and InnoDB treats the clustered key as the row identity. AxiomDB now does the same for SQL-visible clustered INSERT instead of manufacturing a heap row plus compatibility index entry.
Clustered Tree Insert Controller (Phase 39.3)
axiomdb-storage::clustered_tree now builds the first tree-level write path on
top of these page primitives. The public entry point is:
#![allow(unused)]
fn main() {
pub fn insert(
storage: &mut dyn StorageEngine,
root_pid: Option<u64>,
key: &[u8],
row_header: &RowHeader,
row_data: &[u8],
) -> Result<u64, DbError>
}
The controller is still storage-first:
- Bootstrap an empty tree into a
ClusteredLeafroot whenroot_pidisNone. - Descend through
ClusteredInternalpages withfind_child_idx(). - Materialize a clustered leaf descriptor:
- small rows stay fully inline
- large rows keep a local prefix inline and spill the tail bytes to overflow pages
- Insert that descriptor into the target leaf in sorted key order.
- If the descriptor does not fit, defragment once and retry before splitting.
- Split leaves by cumulative cell byte volume, not by cell count.
- Propagate
(separator_key, right_child_pid)upward. - Split internal pages by cumulative separator byte volume and create a new root if the old root overflows.
Split behavior deliberately keeps the old page ID as the left half and allocates only the new right sibling. That matches the current no-concurrent- clustered-writer reality and keeps parent maintenance minimal until the later MVCC/WAL phases wire clustered pages into the full engine.
Since 39.10, rows above the local inline budget are no longer rejected. The
leaf keeps the primary key and RowHeader inline, stores only a bounded local
row prefix on-page, and spills the remaining tail bytes to a dedicated
PageType::Overflow chain.
Clustered Point Lookup (Phase 39.4)
axiomdb-storage::clustered_tree::lookup(...) is now the first read path over
the clustered tree:
#![allow(unused)]
fn main() {
pub fn lookup(
storage: &dyn StorageEngine,
root_pid: Option<u64>,
key: &[u8],
snapshot: &TransactionSnapshot,
) -> Result<Option<ClusteredRow>, DbError>
}
Lookup flow:
- Return
Noneimmediately when the tree has no root. - Descend clustered internal pages with
find_child_idx()andchild_at(). - Run exact-key binary search on the target clustered leaf.
- Read the leaf descriptor
(key, RowHeader, total_row_len, local_prefix, overflow_ptr?). - Apply
RowHeader::is_visible(snapshot). - If the row is overflow-backed, reconstruct the logical row bytes by reading the overflow-page chain.
- Return an owned
ClusteredRowon a visible hit.
In 39.4, lookup is intentionally conservative about invisible rows: when the
current inline version fails MVCC visibility, it returns None instead of
trying to synthesize an older version. Clustered undo/version-chain traversal
for arbitrary snapshots still does not exist; 39.11 adds rollback/savepoint
restore for clustered writes, but not older-version reconstruction on reads.
Clustered Range Scan (Phase 39.5)
axiomdb-storage::clustered_tree::range(...) is now the first ordered multi-row
read path over clustered pages:
#![allow(unused)]
fn main() {
pub fn range<'a>(
storage: &'a dyn StorageEngine,
root_pid: Option<u64>,
from: Bound<Vec<u8>>,
to: Bound<Vec<u8>>,
snapshot: &TransactionSnapshot,
) -> Result<ClusteredRangeIter<'a>, DbError>
}
Range flow:
- Return an empty iterator when the tree is empty or the bound interval is empty.
- For bounded scans, descend to the first relevant leaf with the same clustered internal-page search path used by point lookup.
- For unbounded scans, descend to the leftmost leaf.
- Start at the first in-range slot within that leaf.
- Yield owned
ClusteredRowvalues in primary-key order. - Skip current inline versions that are invisible to the supplied snapshot.
- Follow
next_leafto continue the scan across leaves. - Stop immediately when the first key above the upper bound is seen.
The iterator stays lazy: it keeps only the current leaf page id, slot index, bound copies, and snapshot. It does not materialize the whole range into a temporary vector.
When the iterator advances to another leaf, it calls
StorageEngine::prefetch_hint(next_leaf_pid, 4). The 4-page window is
intentionally conservative: large enough to overlap sequential leaf reads, but
small enough not to flood the page cache while clustered scans are still an
internal storage primitive.
Like 39.4, this subphase is still honest about missing older-version
reconstruction. If a row’s current inline version is invisible, 39.5 skips
it; the new 39.11 rollback support does not change read semantics yet.
Zero-Allocation Full Scan (scan_all_callback, Phase 39.21)
ClusteredRangeIter::next() allocates two heap buffers per row:
cell.key.to_vec() (primary key copy) and reconstruct_row_data (row bytes
copy). For a full-table scan that only needs to decode the row bytes into
Vec<Value>, both allocations are unnecessary.
scan_all_callback bypasses the iterator entirely:
#![allow(unused)]
fn main() {
pub fn scan_all_callback<F>(
storage: &dyn StorageEngine,
root_pid: Option<u64>,
snapshot: &TransactionSnapshot,
mut f: F,
) -> Result<(), DbError>
where
F: FnMut(&[u8], Option<(u64, usize)>) -> Result<(), DbError>,
}
The callback receives (inline_data: &[u8], overflow):
inline_data: a borrow ofcell.row_datadirectly from the leaf page memory — no copy.overflow:Some((first_overflow_page, tail_len))for rows that spill to overflow pages;Nonefor rows that fit inline (the common case for most tables).
For inline rows the callback can decode inline_data in place. The caller
allocates only one Vec<Value> per visible row — the decoded output — compared
to three allocations with the iterator path.
GROUP BY age, AVG(score) query on 50K rows of a clustered table dropped from 57 ms to 4.0 ms (14.25× improvement) after switching from ClusteredRangeIter to scan_all_callback. The bottleneck was ~150K heap allocations per scan (key copy + row copy + Vec<Value>). The callback path eliminates the first two, leaving only the Vec<Value> per row. AxiomDB now runs this query 1.6× faster than MariaDB (6.5 ms) and 2.2× faster than MySQL (8.9 ms) on the same hardware.
Clustered Overflow Pages (Phase 39.10)
Phase 39.10 adds the first overflow-page primitive dedicated to clustered
rows:
Leaf cell:
[key_len: u16]
[total_row_len: u32]
[RowHeader: 24B]
[key bytes]
[local row prefix]
[overflow_first_page?: u64]
Overflow page body:
[next_overflow_page: u64]
[payload bytes...]
The contract is intentionally physical:
- Keep the primary key and
RowHeaderinline in the clustered leaf. - Keep only a bounded local row prefix inline.
- Spill the remaining logical row tail to
PageType::Overflowpages. - Reconstruct the full logical row only on read paths (
lookup,range) or update paths that need the logical bytes. - Let split / merge / rebalance move the physical descriptor without rewriting the overflow payload.
Phase 39.10 itself intentionally did not introduce generic TOAST
references, compression, or crash recovery for overflow chains. 39.11 now
adds in-process clustered WAL/rollback over those row images, but clustered
crash recovery still stays in later phases.
Clustered WAL and Rollback (Phase 39.11)
Phase 39.11 adds the first WAL contract that understands clustered rows:
key = primary-key bytes
old_value = ClusteredRowImage? // exact old row image
new_value = ClusteredRowImage? // exact new row image
Where ClusteredRowImage carries:
- the latest clustered
root_pid - the exact inline
RowHeader - the exact logical row bytes, regardless of whether the row is inline or overflow-backed on page
TxnManager now tracks the latest clustered root per table_id during the
active transaction. Rollback and savepoint undo use that root plus two storage
helpers:
delete_physical_by_key(...)to undo a clustered insertrestore_exact_row_image(...)to undo clustered delete-mark or update
The restore invariant is logical row state, not exact page topology. Split,
merge, or relocate-update may still leave a different physical tree shape after
rollback as long as the old primary key, RowHeader, and row bytes are back.
Phase 39.12 now extends that same contract into clustered crash recovery:
open_with_recovery() undoes in-progress clustered writes by PK + exact row
image, and open() rebuilds committed clustered roots from surviving WAL
history on a clean reopen.
Clustered Update In Place (Phase 39.6)
axiomdb-storage::clustered_tree::update_in_place(...) is now the first
clustered-row write path after insert:
#![allow(unused)]
fn main() {
pub fn update_in_place(
storage: &mut dyn StorageEngine,
root_pid: Option<u64>,
key: &[u8],
new_row_data: &[u8],
txn_id: u64,
snapshot: &TransactionSnapshot,
) -> Result<bool, DbError>
}
Update flow:
- Return
falsewhen the tree is empty, the key is absent, or the current inline version is not visible to the supplied snapshot. - Descend to the owning clustered leaf by primary key.
- Build a new inline
RowHeaderwith:txn_id_created = txn_idtxn_id_deleted = 0row_version = old.row_version + 1
- Materialize a replacement descriptor:
- inline row
- or local-prefix + overflow chain
- Ask the leaf primitive to rewrite that exact cell while preserving key order.
- Persist the leaf if the rewrite stays inside the same page.
- Free the obsolete overflow chain only after a successful physical rewrite.
- Return
HeapPageFullwhen the replacement row would require leaving the current leaf.
The leaf primitive has two rewrite modes:
- overwrite fast path when the replacement encoded cell fits the existing cell budget
- same-leaf rebuild fallback when the row grows, but the leaf can still be rebuilt compactly with the replacement row in place
Neither path changes the primary key, pointer-array order, parent separators, or
next_leaf.
This keeps the subphase honest about what now exists:
- clustered insert
- clustered point lookup
- clustered range scan
- clustered same-leaf update
- clustered delete-mark
And what still does not:
- clustered older-version reconstruction/version chains
- clustered root persistence beyond WAL checkpoint/rotation
- clustered physical purge
- clustered SQL executor integration
Clustered Delete Mark (Phase 39.7)
axiomdb-storage::clustered_tree::delete_mark(...) now adds the first logical
delete path over clustered pages:
#![allow(unused)]
fn main() {
pub fn delete_mark(
storage: &mut dyn StorageEngine,
root_pid: Option<u64>,
key: &[u8],
txn_id: u64,
snapshot: &TransactionSnapshot,
) -> Result<bool, DbError>
}
Delete flow:
- Return
falsewhen the tree is empty, the key is absent, or the current inline version is not visible to the supplied snapshot. - Descend to the owning clustered leaf by primary key.
- Build a replacement
RowHeaderthat preserves:txn_id_createdrow_version_flagsand stampstxn_id_deleted = txn_id.
- Rewrite the exact clustered cell in place while preserving key bytes and row payload bytes.
- Persist the leaf page without changing
next_leafor parent separators.
The important semantic boundary is that clustered delete is currently a
header-state transition, not space reclamation. The physical cell stays on
the leaf page so snapshots older than the delete can still observe it through
the existing RowHeader::is_visible(...) rule.
Clustered Structural Rebalance (Phase 39.8)
axiomdb-storage::clustered_tree::update_with_relocation(...) adds the first
clustered structural-maintenance path:
#![allow(unused)]
fn main() {
pub fn update_with_relocation(
storage: &mut dyn StorageEngine,
root_pid: Option<u64>,
key: &[u8],
new_row_data: &[u8],
txn_id: u64,
snapshot: &TransactionSnapshot,
) -> Result<Option<u64>, DbError>
}
Control flow:
- Validate that the replacement row still fits inline on a clustered leaf.
- Try
update_in_place(...)first. - If the same-leaf rewrite returns
HeapPageFull, reload the visible current row and enter the structural path. - Physically remove the exact clustered cell from the tree.
- Bubble
underfullandmin_changedupward:- repair the parent separator when a non-leftmost child changes its minimum key
- redistribute or merge clustered leaf siblings by encoded byte volume
- redistribute or merge clustered internal siblings while preserving
n keys -> n + 1 children
- Collapse an empty internal root to its only child.
- Reinsert the replacement row with bumped
row_version.
The key design boundary is that 39.8 introduces private structural delete
only for relocate-update. Public clustered delete is still delete_mark(...),
so snapshot-safe purge remains a later concern.
Current limitations:
delete_mark(...)still keeps dead clustered cells inline;39.8does not expose purge to SQL or storage callers yet.- relocate-update still rewrites only the current inline version.
- parent separator repair currently assumes the repaired separator still fits in the existing internal page budget; split-on-separator-repair is deferred.
Clustered Secondary Bookmarks (Phase 39.9)
Phase 39.9 adds the first clustered-first secondary-index layout in
axiomdb-sql/src/clustered_secondary.rs.
The physical key is:
secondary_logical_key ++ missing_primary_key_columns
Where:
secondary_logical_keyis the ordered value vector of the secondary index columns.missing_primary_key_columnsare only the PK columns that are not already present in the secondary key.
That means the physical secondary entry now carries enough information to
recover the owning clustered row by primary key without depending on a heap
RecordId.
The dedicated helpers now provide:
- layout derivation from
(secondary_idx, primary_idx) - encode/decode of bookmark-bearing secondary keys
- logical-prefix bounds without a fixed 10-byte RID suffix
- insert/delete/update maintenance where relocate-only updates become no-ops if the logical secondary key and primary key stay stable
Current boundary:
- this path is not wired into the heap-backed SQL executor yet
- FK enforcement and index-integrity rebuilds still use the old
RecordId-based secondary path - the legacy
RecordIdpayload inaxiomdb-index::BTreeremains only a compatibility artifact for this path
MmapStorage — Memory-Mapped File
MmapStorage uses a hybrid I/O model inspired by SQLite: read-only mmap for reads,
pwrite() for writes. The mmap is opened with memmap2::Mmap (not MmapMut),
making it structurally impossible to write through the mapped region.
Physical file (axiomdb.db):
┌──────────┬──────────┬──────────┬──────────┬──────────┐
│ Page 0 │ Page 1 │ Page 2 │ Page 3 │ ... │
│ (Meta) │ (Data) │ (Index) │ (Data) │ │
└──────────┴──────────┴──────────┴──────────┴──────────┘
↑ ↑ ↓
│ └── read_page(1): copy 16KB from mmap → owned PageRef
└── mmap (read-only, MAP_SHARED)
write_page(3): pwrite() to file descriptor
Read path: mmap + PageRef copy
read_page(page_id) computes mmap_ptr + page_id * 16384, copies 16 KB into a
heap-allocated PageRef, verifies the CRC32c checksum, and returns the owned copy.
The copy cost (~0.5 us from L2/L3 cache) is the same price PostgreSQL pays when
copying a buffer pool page into backend-local memory.
Write path: pwrite() to file descriptor
write_page(page_id, page) calls pwrite() on the underlying file descriptor at
offset page_id * 16384. The mmap (MAP_SHARED) automatically reflects the change
on subsequent reads. Note that a 16 KB pwrite() is not crash-atomic on 4 KB-block
filesystems — the Doublewrite Buffer protects against torn pages.
Flush: doublewrite + fsync
flush() follows a two-phase write protocol:
- Doublewrite phase: all dirty pages (plus pages 0 and 1) are serialized to a
.dwfile and fsynced. This creates a durable copy of the committed state. - Main fsync: the freelist is pwritten (if modified) and the main
.dbfile is fsynced. If this fsync is interrupted by a crash, the.dwfile provides repair data on the next startup. - Cleanup: the
.dwfile is deleted. If deletion fails, the nextopen()finds all pages valid and removes it.
Trade-offs:
- We cannot control which pages stay hot in memory (the OS uses LRU).
- On 32-bit systems, the address space limits the maximum database size. On 64-bit, the address space is effectively unlimited.
PageRefcopies add ~0.5 us per page read vs. direct pointer access, but this eliminates use-after-free risks from mmap remap and page reuse.
Deferred Page Free Queue
When free_page(page_id) is called, the page does not return to the freelist
immediately. Instead it enters an epoch-tagged queue: deferred_frees: Vec<(page_id, freed_at_snapshot)>. Each entry records the snapshot epoch at which the page became
unreachable. release_deferred_frees(oldest_active_snapshot) only releases pages
whose freed_at_snapshot <= oldest_active_snapshot — pages freed more recently remain
queued because a concurrent reader might still hold a snapshot that references them.
Under the current Arc<RwLock<Database>> architecture, flush() passes u64::MAX
(release all) because the writer holds exclusive access and no readers are active.
When snapshot slot tracking is added (Phase 7.8), the actual oldest active snapshot
will be used instead. The queue is capped at 4096 entries with a tracing warning
to detect snapshot leaks.
Doublewrite Buffer
A 16 KB pwrite() is not crash-atomic on any modern filesystem with 4 KB
internal blocks (APFS, ext4, XFS, ZFS). A power failure mid-write leaves a torn
page: the first N×4 KB contain new data, the remainder holds the previous state.
CRC32c detects this corruption on startup, but without a repair source the database
cannot open.
The doublewrite (DW) buffer solves this. Before every flush(), all dirty pages
are serialized to a .dw file alongside the main .db file:
database.db ← main data file
database.db.dw ← doublewrite buffer (transient, exists only during flush)
DW File Format
[Header: 16 bytes]
magic: "AXMDBLWR" (8 bytes)
version: u32 LE = 1
slot_count: u32 LE
[Slots: slot_count × 16,392 bytes each]
page_id: u64 LE
page_data: [u8; 16384]
[Footer: 8 bytes]
file_crc: CRC32c(header || all slots)
sentinel: 0xDEAD_BEEF
Flush Protocol
1. Collect dirty pages + pages 0 and 1 from the mmap
2. Write all to .dw file → single sequential write
3. fsync .dw file ← committed copy durable
4. pwrite freelist to main file
5. fsync main file ← main data durable
6. Delete .dw file ← cleanup (non-fatal on failure)
Startup Recovery
On MmapStorage::open(), if a .dw file exists:
- Validate the DW file (magic, version, size, CRC, sentinel)
- For each slot: read the corresponding page from the main file
- If CRC is invalid (torn page) → restore from DW copy
- fsync the main file → repairs durable
- Delete the
.dwfile
Recovery is idempotent: if interrupted, the DW file is still valid and the next startup reruns recovery. Pages already repaired have valid CRCs and are skipped.
#ib_*.dblwr file for
better sequential I/O and zero impact on the tablespace format. AxiomDB follows this
newer approach: the DW file is sequential-write-only, does not change the main file
format, and requires no migration for existing databases.
Dirty Page Tracking and Targeted Flush
MmapStorage tracks every page written since the last flush() in a
PageDirtyTracker (an in-memory HashSet<u64>). On flush(), instead of
calling mmap.flush() (which issues msync over the entire file), AxiomDB
coalesces the dirty page IDs into contiguous runs and issues one flush_range
call per run.
Coalescing algorithm
PageDirtyTracker::contiguous_runs() sorts the dirty IDs and merges adjacent
IDs into (start_page, run_length) pairs:
#![allow(unused)]
fn main() {
// Dirty pages: {2, 3, 5, 6, 7} → runs: [(2, 2), (5, 3)]
// Byte ranges: [(2*16384, 32768), (5*16384, 49152)]
}
The merge is O(n log n) on the number of dirty pages and produces the minimum
number of msync syscalls for any given dirty set.
Freelist integration
When the freelist changes (alloc_page, free_page), freelist_dirty is set.
On flush(), the freelist bitmap is serialized into page 1 first, and page 1 is
added to the effective flush set even if it was not already in the dirty tracker.
Only after all targeted flushes succeed are freelist_dirty and the dirty
tracker cleared. A partial failure leaves both intact so the next flush() can
retry safely.
Disk-full error classification
Every durable I/O call in flush() (and in create()/grow()) passes its
std::io::Error through classify_io() before returning:
#![allow(unused)]
fn main() {
// axiomdb-core/src/error.rs
pub fn classify_io(err: std::io::Error, operation: &'static str) -> DbError {
// ENOSPC (28) and EDQUOT (69/122) → DbError::DiskFull { operation }
// All other errors → DbError::Io(err)
}
}
When a DiskFull error propagates out of MmapStorage, the server runtime
transitions to read-only degraded mode — all subsequent mutating statements
are rejected immediately without re-entering the storage layer.
Invariants
flush()returnsOk(())only after all dirty pages are durable.- Dirty tracking is cleared only on success — never on failure.
- The freelist page (page 1) is always included when
freelist_dirtyis set, regardless of whether it appears in the tracker. dirty_page_count()always reflects the count since the last successful flush.ENOSPC/EDQUOTerrors are always surfaced asDbError::DiskFull, never silently wrapped inDbError::Io.
Verified Open — Corruption Detection at Startup
MmapStorage::open() validates every allocated page before making the
database available. The startup sequence is:
- Map the file and verify page 0 (meta) — magic, version, page count.
- Load the freelist from page 1 and verify its checksum.
- Scan pages
2..page_count, skipping any page the freelist marks as free. For each allocated page, callread_page_from_mmap()which re-computes the CRC32c of the body and compares it to the storedheader.checksum.
#![allow(unused)]
fn main() {
for page_id in 2..page_count {
if !freelist.is_free(page_id) {
Self::read_page_from_mmap(&mmap, page_id)?;
}
}
}
If any page fails, open() returns DbError::ChecksumMismatch { page_id, expected, got }
immediately. No connection is accepted and no Db handle is returned.
Free pages are skipped because they are never written by the storage engine and therefore have no valid page header or checksum. Scanning them would produce false positives on a freshly created or partially filled database.
Recovery wiring
Both the network server (Database::open) and the embedded handle (Db::open)
route through TxnManager::open_with_recovery() on every reopen:
#![allow(unused)]
fn main() {
let (txn, _recovery) = TxnManager::open_with_recovery(&mut storage, &wal_path)?;
}
This ensures WAL replay runs before the first query is executed, even if the
only change in this subphase is the corruption scan. Bypassing
open_with_recovery() with the older TxnManager::open() was an oversight
that this subphase closes.
MemoryStorage — In-Memory for Tests
MemoryStorage stores pages in a Vec<Box<Page>>. It implements the same
StorageEngine trait as MmapStorage. All unit tests for the B+ Tree, WAL,
and catalog use MemoryStorage, so they run without touching the filesystem.
#![allow(unused)]
fn main() {
let mut storage = MemoryStorage::new();
let id = storage.alloc_page(PageType::Data)?;
let mut page = Page::new(PageType::Data, id);
page.body_mut()[0] = 0xAB;
page.update_checksum();
storage.write_page(id, &page)?;
let read_back = storage.read_page(id)?;
assert_eq!(read_back.body()[0], 0xAB);
}
FreeList — Page Allocation
The FreeList tracks which pages are free using a bitmap. The bitmap is stored in a
dedicated page (or pages, for large databases). Each bit corresponds to one page:
1 = free, 0 = in use.
Allocation
Scans left-to-right for the first 1 bit, clears it, and returns the page ID.
Bitmap: 1110 1101 ...
↑
First free: page 0 (bit 0 = 1)
After allocation: 0110 1101 ...
Deallocation
Sets the bit corresponding to page_id back to 1. Returns
DbError::DoubleFree if the bit was already 1 (guard against bugs in the
caller).
Invariants
- No page appears twice in the freelist.
- No page can be both allocated and in the freelist simultaneously.
- The freelist bitmap is itself stored in allocated pages (and tracked recursively during bootstrap).
Heap Pages — Slotted Format
Table rows (heap tuples) are stored in PageType::Data pages using a slotted page
layout. The slot array grows from the start of the body; tuples grow from the end
toward the center.
Body (16,320 bytes):
┌─────────────────────────────────────────────────────────────┐
│ Slot[0] │ Slot[1] │ ... │ free space │ ... │ Tuple[1] │ Tuple[0] │
└──────────────────────────────────────────────────────────────┘
↑ ↑ ↑
free_start free area free_end (decreases)
free_start points to the first unused byte after the last slot entry.
free_end points to the first byte of the last tuple written (counting from the
end of the body).
SlotEntry — 4 bytes
Offset Size Field
0 2 offset — byte offset of the tuple within the body (0 = empty slot)
2 2 length — total length of the tuple in bytes
A slot with offset = 0 and length = 0 is an empty (deleted) slot. Deleted slots
are reused when the page is compacted (VACUUM, planned Phase 9).
RowHeader — 24 bytes
Every heap tuple begins with a RowHeader that stores MVCC visibility metadata:
Offset Size Field
0 8 xmin — txn_id of the transaction that inserted this row
8 8 xmax — txn_id of the transaction that deleted/updated this row (0 = live)
16 1 deleted — 1 if this row has been logically deleted
17 7 _pad — alignment
Total: 24 bytes
After the RowHeader comes the null bitmap and the encoded column data (see Row Codec).
Null Bitmap in Heap Rows
The null bitmap is stored immediately after the RowHeader. It occupies
ceil(n_cols / 8) bytes. Bit i (zero-indexed) being 1 means column i is NULL.
5 columns → ceil(5/8) = 1 byte = 8 bits (bits 5-7 unused, always 0)
11 columns → ceil(11/8) = 2 bytes
Page 0 — The Meta Page
Page 0 is the PageType::Meta page. It is written during database creation
(bootstrap) and read during open(). Its body contains:
Offset Size Field
0 8 format_version — AxiomDB file format version
8 8 catalog_root_page — Page ID of the catalog root (axiom_tables B+ Tree root)
16 8 freelist_root_page — Page ID of the freelist bitmap root
24 8 next_txn_id — Next transaction ID to assign
32 8 checkpoint_lsn — LSN of the last successful checkpoint
40 rest _reserved — Future extensions
On crash recovery, the checkpoint_lsn tells the WAL reader where to start replaying.
All WAL entries with LSN > checkpoint_lsn and belonging to committed transactions
are replayed.
Batch Delete Operations
AxiomDB implements three optimizations for DELETE workloads that dramatically reduce page I/O and CRC32c computation overhead.
HeapChain::delete_batch()
delete_batch() accepts a slice of (page_id, slot_id) pairs and groups them by
page_id before touching any page. For each unique page it reads the page once,
marks all targeted slots dead in a single pass, then writes the page back once.
Naive per-row delete path (before delete_batch):
for each of N rows:
read_page(page_id) ← 1 read
mark slot dead ← 1 mutation
update_checksum(page) ← 1 CRC32c over 16 KB
write_page(page_id, page) ← 1 write
Total: 3N page operations
Batch path (delete_batch):
group rows by page_id → P unique pages
for each page:
read_page(page_id) ← 1 read
mark all M slots dead ← M mutations (M rows on this page)
update_checksum(page) ← 1 CRC32c (once per page, not per row)
write_page(page_id, page) ← 1 write
Total: 2P page operations
At 200 rows/page, deleting 10,000 rows hits 50 pages. The naive path requires 30,000
page operations; delete_batch() requires 100.
mark_deleted() vs delete_tuple() — Splitting Checksum Work
heap::mark_deleted() is an internal function that stamps the slot as dead without
recomputing the page checksum. delete_tuple() (the single-row public API) calls
mark_deleted() followed immediately by update_checksum() — behavior is unchanged
for callers.
The batch path calls mark_deleted() N times (once per slot on a given page), then
calls update_checksum() exactly once when all slots on that page are done.
#![allow(unused)]
fn main() {
// Single-row path (public, unchanged):
pub fn delete_tuple(page: &mut Page, slot_id: u16) -> Result<(), DbError> {
mark_deleted(page, slot_id)?; // stamp dead
page.update_checksum(); // 1 CRC32c
Ok(())
}
// Batch path (called by delete_batch for each page):
for &slot_id in slots_on_this_page {
mark_deleted(page, slot_id)?; // stamp dead, no checksum
}
page.update_checksum(); // 1 CRC32c for all N slots on this page
}
mark_deleted from update_checksum makes the cost O(P) in the number of pages, not O(N) in the number of rows. The same split was applied to insert_batch in Phase 3.17.
scan_rids_visible()
HeapChain::scan_rids_visible() is a variant of scan_visible() that returns only
(page_id, slot_id) pairs — no row data is decoded or copied.
#![allow(unused)]
fn main() {
pub fn scan_rids_visible(
&self,
storage: &dyn StorageEngine,
snapshot: &TransactionSnapshot,
self_txn_id: u64,
) -> Result<Vec<(u64, u16)>, DbError>
}
This is used by DELETE without a WHERE clause and TRUNCATE TABLE: both operations
need to locate every live slot but neither needs to decode the row’s column values.
Avoiding Vec<u8> allocation for each row’s payload cuts memory allocation to near
zero for full-table deletes.
HeapChain::clear_deletions_by_txn()
clear_deletions_by_txn(txn_id) is the undo helper for WalEntry::Truncate. It
scans the entire heap chain and, for every slot where txn_id_deleted == txn_id,
clears the deletion stamp (sets txn_id_deleted = 0, deleted = 0).
This is used during ROLLBACK and crash recovery when a WalEntry::Truncate must be
undone. The cost is O(P) page reads and writes for P pages in the chain — identical
to a full-table scan. Because recovery and rollback are infrequent relative to inserts
and deletes, this trade-off is acceptable (see WAL internals for the corresponding
WalEntry::Truncate design decision).
All-Visible Page Flag (Optimization A)
What it is
Bit 0 of PageHeader.flags (PAGE_FLAG_ALL_VISIBLE = 0x01). When set, it
asserts that every alive slot on the page was inserted by a committed transaction
and none have been deleted. Sequential scans can skip per-slot MVCC
txn_id_deleted tracking for those pages entirely.
Inspired by PostgreSQL’s all-visible map (src/backend/storage/heap/heapam.c:668),
but implemented as an in-page bit rather than a separate VM file — a single
cache-line read suffices.
API
#![allow(unused)]
fn main() {
pub const PAGE_FLAG_ALL_VISIBLE: u8 = 0x01;
impl Page {
pub fn is_all_visible(&self) -> bool { ... } // reads bit 0 of flags
pub fn set_all_visible(&mut self) { ... } // sets bit 0; caller updates checksum
pub fn clear_all_visible(&mut self) { ... } // clears bit 0; caller updates checksum
}
}
Lazy-set during scan
HeapChain::scan_visible() sets the flag after verifying that all alive slots
on a page satisfy:
txn_id_created <= max_committed(committed transaction)txn_id_deleted == 0(not deleted)
This is a one-time write per page per table lifetime. After the first slow-path scan, every subsequent scan takes the fast path and skips per-slot checks.
Clearing on delete
heap::mark_deleted() clears the flag unconditionally as its very first
mutation — before stamping txn_id_deleted. Both changes land in the same
update_checksum() + write_page() call. There is no window where the flag is
set while a slot is deleted.
Read-only variant for catalog scans
HeapChain::scan_visible_ro() takes &dyn StorageEngine (immutable) and never
sets the flag. Used by CatalogReader and other callers that hold only a shared
reference. Catalog tables are small (a few pages) and not hot enough to warrant
the lazy-set write.
Sequential Scan Prefetch Hint (Optimization C)
What it is
StorageEngine::prefetch_hint(start_page_id, count) — a hint method telling
the backend that pages starting at start_page_id will be read sequentially.
Implementations that do not support prefetch provide a default no-op.
Inspired by PostgreSQL’s read_stream.c adaptive lookahead.
API
#![allow(unused)]
fn main() {
// Default no-op in the trait — all existing backends compile unchanged
fn prefetch_hint(&self, start_page_id: u64, count: u64) {}
}
MmapStorage overrides this with madvise(MADV_SEQUENTIAL) on macOS and Linux:
#![allow(unused)]
fn main() {
#[cfg(any(target_os = "linux", target_os = "macos"))]
fn prefetch_hint(&self, start_page_id: u64, count: u64) {
// SAFETY: ptr derived from live MmapMut, offset < mmap_len verified,
// clamped_len <= mmap_len - offset. madvise is a pure hint.
let _ = unsafe { libc::madvise(ptr, clamped_len, libc::MADV_SEQUENTIAL) };
}
}
count = 0 uses the backend default (PREFETCH_DEFAULT_PAGES = 64, 1 MB).
Call sites
HeapChain::scan_visible(), scan_rids_visible(), and delete_batch() each
call storage.prefetch_hint(root_page_id, 0) once before their scan loop. This
tells the OS kernel to begin async read-ahead for the pages that follow,
overlapping disk I/O with CPU processing of the current page.
When it helps
The hint has measurable impact on cold-cache workloads (data not in OS page
cache). On warm cache (mmap pages already faulted in), madvise is accepted
but the kernel takes no additional action — no performance regression.
Lazy Column Decode (Optimization B)
What it is
decode_row_masked(bytes, schema, mask) — a variant of decode_row that accepts
a boolean mask. When mask[i] == false, the column’s wire bytes are skipped
(cursor advanced, no allocation) and Value::Null is placed in the output slot.
Inspired by PostgreSQL’s selective column access in the executor.
API
#![allow(unused)]
fn main() {
pub fn decode_row_masked(
bytes: &[u8],
schema: &[DataType],
mask: &[bool], // mask.len() must equal schema.len()
) -> Result<Vec<Value>, DbError>
}
For skipped columns:
- Fixed-length types (Bool=1B, Int/Date=4B, BigInt/Real/Timestamp=8B, Decimal=17B, Uuid=16B):
ensure_bytesis called thenposadvances — no allocation. - Variable-length types (Text, Bytes): the 3-byte length prefix is read to advance
posby3 + len— the payload is never copied or parsed. - NULL columns (bitmap bit set): no wire bytes, cursor unchanged regardless of mask.
Column mask computation
The executor computes the mask via collect_column_refs(expr, mask), which walks
the AST and marks every Expr::Column { col_idx } reference. It does not recurse
into subquery bodies (different row scope).
SELECT * (Wildcard/QualifiedWildcard) always produces None — decode_row()
is used directly with no overhead.
When all mask bits are true, scan_table also uses decode_row() directly.
Where it applies
execute_select_ctx(single-table SELECT): mask covers SELECT list + WHERE + ORDER BY + GROUP BY + HAVINGexecute_delete_ctx(DELETE with WHERE): mask covers the WHERE clause only (no-WHERE path usesscan_rids_visible— no decode at all)
Clustered Leaf Page-Buffer Mutation Primitives (Phase 39.22)
Three public primitives in crates/axiomdb-storage/src/clustered_leaf.rs enable
zero-allocation in-place UPDATE for fixed-size columns.
cell_row_data_abs_off
#![allow(unused)]
fn main() {
pub fn cell_row_data_abs_off(page: &Page, cell_idx: usize) -> Result<(usize, usize), DbError>
}
Computes the absolute byte offset of row_data within the page buffer for a
given cell index without decoding the cell. Returns (row_data_abs_off, key_len).
Formula:
row_data_abs_off = HEADER_SIZE + body_off + CELL_META_SIZE + ROW_HEADER_SIZE + key_len
Used by the UPDATE fast path to locate field bytes directly in the page buffer —
no cell.row_data.to_vec() required.
patch_field_in_place
#![allow(unused)]
fn main() {
pub fn patch_field_in_place(page: &mut Page, field_abs_off: usize, new_bytes: &[u8]) -> Result<(), DbError>
}
Overwrites new_bytes.len() bytes at field_abs_off within the page buffer.
Validates that field_abs_off + new_bytes.len() <= PAGE_SIZE. This is the
AxiomDB equivalent of InnoDB’s btr_cur_upd_rec_in_place().
btr_cur_upd_rec_in_place writes only changed bytes within the B-tree page buffer. AxiomDB implements the same technique with a pure-Rust zero-unsafe byte-write primitive. For UPDATE t SET score = score + 1 on a 25K-row clustered table, this reduces per-row work from ~469 bytes (full decode + encode + heap alloc) to ~28 bytes (read field + write field), cutting allocations from 5 per row to zero.
update_row_header_in_place
#![allow(unused)]
fn main() {
pub fn update_row_header_in_place(page: &mut Page, cell_idx: usize, new_header: &RowHeader) -> Result<(), DbError>
}
Overwrites the 24-byte RowHeader at the exact page offset for a given cell.
Used after patch_field_in_place to stamp the new txn_id_created and
incremented row_version without re-encoding the full cell.
Split-Phase Pattern (Rust Borrow Checker Compatibility)
The UPDATE fast path uses a split-phase read/write pattern to satisfy the Rust borrow checker — the immutable page borrow (read phase) must be fully dropped before the mutable borrow (write phase) begins:
#![allow(unused)]
fn main() {
// Read phase: immutable borrow — compute field locations, capture old bytes
let (row_data_abs_off, _) = cell_row_data_abs_off(&page, idx)?;
let (field_writes, any_change) = {
let b = page.as_bytes();
// ... compute loc, capture old_buf: [u8;8], encode new_buf: [u8;8]
// MAYBE_NOP: if old_buf[..loc.size] == new_buf[..loc.size] { skip }
(field_writes_vec, changed)
}; // immutable borrow dropped here
if !any_change { continue; }
// Write phase: mutable borrow — patch page buffer directly
for (field_abs, size, _, new_buf) in &field_writes {
patch_field_in_place(&mut page, *field_abs, &new_buf[..*size])?;
}
update_row_header_in_place(&mut page, idx, &new_header)?;
}
row_data.to_vec() (heap allocation) by keeping field locations in a stack-allocated Vec<(usize, usize, [u8;8], [u8;8])> computed during the immutable phase and consumed during the mutable phase. This is the same invariant InnoDB enforces manually with pointer arithmetic.
WAL and Crash Recovery
The Write-Ahead Log (WAL) is AxiomDB’s durability mechanism. Before any change reaches the storage engine’s pages, a record of that change is appended to the WAL file. On crash recovery, the WAL is replayed to reconstruct any changes that were committed but not yet flushed to the data file.
WAL File Layout
The WAL file starts with a 32-byte file header followed by an unbounded sequence of WAL entries.
File Header — 32 bytes
Offset Size Field
0 4 magic — 0x57414C4E ("WALN") — identifies a valid WAL file
4 2 version — WAL format version (currently 1)
6 26 _reserved — Future use
WalReader::open verifies the magic and version before any scan. An incorrect
magic returns DbError::WalInvalidHeader.
Entry Binary Format
Each WAL entry is a self-delimiting binary record. The total entry length is stored both at the beginning and at the end to support both forward and backward scanning.
Offset Size Field
──────── ─────────── ─────────────────────────────────────────────────────
0 4 entry_len u32 LE — total entry length in bytes
4 8 lsn u64 LE — Log Sequence Number (globally monotonic)
12 8 txn_id u64 LE — Transaction ID (0 = autocommit)
20 1 entry_type u8 — EntryType (see below)
21 4 table_id u32 LE — table identifier (0 = system operations)
25 2 key_len u16 LE — key length in bytes (0 for BEGIN/COMMIT/ROLLBACK)
27 key_len key [u8] — mutation key bytes (heap RID or clustered PK)
? 4 old_val_len u32 LE — old value length (0 for INSERT, BEGIN, COMMIT, ROLLBACK)
? old_len old_value [u8] — old encoded row (empty on INSERT)
? 4 new_val_len u32 LE — new value length (0 for DELETE, BEGIN, COMMIT, ROLLBACK)
? new_len new_value [u8] — new encoded row (empty on DELETE)
? 4 crc32c u32 LE — CRC32c of all preceding bytes in this entry
? 4 entry_len_2 u32 LE — copy of entry_len for backward scan
Minimum size (no key, no values): 4+8+8+1+4+2 + 4+4+4+4 = 43 bytes
Why entry_len_2 at the end
To traverse the WAL backward (during ROLLBACK or crash recovery), the reader needs to find the start of the previous entry given only the current position (end of entry).
entry_start = current_position - entry_len_2
The reader seeks to entry_start, reads entry_len, verifies it equals entry_len_2,
then reads the full entry. If the lengths do not match, the entry is corrupt.
entry_len at both ends of every entry enables backward scanning with a single seek per entry — no secondary index or reverse pointer table needed. The cost is 4 bytes per entry (overhead for a WAL with 10M entries: 40 MB, negligible relative to data payload).
Mutation Key Encoding
Heap and clustered mutations do not use the same key contract:
Heap INSERT / UPDATE / DELETE / UpdateInPlace:
key_len = 10
key[0..8] = page_id as u64 LE
key[8..10] = slot_id as u16 LE
ClusteredInsert / ClusteredDeleteMark / ClusteredUpdate (Phases 39.11 / 39.12):
key_len = primary_key_bytes.len()
key = encoded primary-key bytes
Heap mutations still record the exact page and slot where the row was written,
so redo can target the same physical location directly. Clustered mutations do
not: clustered pages defragment, split, merge, and relocate rows, so (page_id, slot_id) is not a stable undo key. Their payloads instead store the exact
logical row image and the latest clustered root_pid.
Entry Types
#![allow(unused)]
fn main() {
pub enum EntryType {
Begin = 1, // START of an explicit transaction
Commit = 2, // COMMIT — all preceding entries for this txn_id are durable
Rollback = 3, // ROLLBACK — all preceding entries for this txn_id must be undone
Insert = 4, // INSERT: old_value is empty; new_value is the encoded new row
Delete = 5, // DELETE: old_value is the encoded row before deletion; new_value empty
Update = 6, // UPDATE: both old_value and new_value are present
Checkpoint = 7, // CHECKPOINT: marks the LSN up to which pages are flushed to disk
Truncate = 8, // Full-table delete (DELETE without WHERE, TRUNCATE TABLE)
PageWrite = 9, // Bulk insert page image + slot list
UpdateInPlace = 10, // Stable-RID same-slot update
ClusteredInsert = 12, // Clustered insert keyed by PK + exact new row image
ClusteredDeleteMark = 13, // Clustered delete-mark keyed by PK + old/new row image
ClusteredUpdate = 14, // Clustered update keyed by PK + old/new row image
}
}
Transaction entries (Begin, Commit, Rollback) carry no key or value payload —
key_len = 0, old_val_len = 0, new_val_len = 0. The minimum entry size of 43 bytes
applies to these records.
PageWrite and UpdateInPlace are physical optimization records. They do not change
SQL-visible semantics; they only change how AxiomDB amortizes I/O for common write
patterns while preserving rollback and crash recovery guarantees.
WalEntry::Truncate — Full-Table Delete
WalEntry::Truncate (entry type 8) is emitted instead of N individual Delete
entries when a statement deletes every row in a table: DELETE FROM t without a
WHERE clause, and TRUNCATE TABLE t.
Binary Format
Field Value
─────────────── ────────────────────────────────────────────────────────
entry_type 8 (Truncate)
table_id the target table's ID (u32 LE)
key_len 8
key[0..8] root_page_id of the HeapChain as u64 LE
old_val_len 0 (empty — no per-row data stored)
new_val_len 0 (empty)
The key encodes the heap chain’s root page rather than a single slot, because the undo operation scans the entire chain.
Why One Entry Instead of N
For a 10,000-row table, the per-row path writes 10,000 Delete WAL entries. Each
entry carries at minimum 43 bytes of header plus the encoded row payload (old_value),
which may be hundreds of bytes. WalEntry::Truncate replaces all N entries with a
single 51-byte record (43-byte minimum + 8-byte key).
Per-row Delete path (N = 10,000 rows, avg 100-byte payload):
WAL entries: 10,000
WAL bytes written: 10,000 × (43 + 10 + 100) ≈ 1.5 MB
Truncate path:
WAL entries: 1
WAL bytes written: 51 bytes
DELETE FROM t without a WHERE clause. For a 10K-row table, InnoDB writes ~10,000 undo records; AxiomDB writes 1 WAL entry. This is the same optimization that MariaDB's storage engine API exposes via ha_delete_all_rows(), but AxiomDB applies it at the WAL level, not just the engine level.
Undo — Rollback and Crash Recovery
Because WalEntry::Truncate stores no per-row state, undo cannot simply replay
individual slot reverts from the WAL. Instead, undo calls
HeapChain::clear_deletions_by_txn(txn_id), which scans the heap chain and clears
the txn_id_deleted stamp on every slot that was deleted by this transaction:
Undo of WalEntry::Truncate for txn_id T:
for each page in the HeapChain:
read_page(page_id)
for each slot on the page:
if slot.txn_id_deleted == T:
slot.txn_id_deleted = 0
slot.deleted = 0
write_page(page_id, page)
The physical heap is fully restored: all rows that were alive before the DELETE become visible again to transactions with a snapshot predating txn_id T.
Truncate entry itself, enabling O(N) targeted undo without a full scan. We chose the scan approach because: (1) WAL writes are on the critical path of every DELETE; (2) undo (rollback and crash recovery) is rare relative to DELETE frequency; (3) the scan is O(P) in pages, not O(N) in rows, and P ≪ N at 200 rows/page. The trade-off mirrors MariaDB's ha_delete_all_rows() philosophy: optimize the common path (write), accept a bounded cost on the uncommon path (undo).
Crash Recovery Handling
During WAL replay, when the recovery engine encounters WalEntry::Truncate for a
committed transaction, it calls HeapChain::delete_batch() with all live slot IDs
found by scan_rids_visible() — re-applying the deletion to any pages that may not
have been flushed before the crash. If the transaction was not committed (no matching
Commit entry in the WAL), the entry is skipped: the heap still contains the
pre-delete state because the crash occurred before the commit was durable.
WalEntry::UpdateInPlace — Stable-RID UPDATE
WalEntry::UpdateInPlace (entry type 10) records a same-slot heap rewrite. It is
emitted when UPDATE can preserve the original (page_id, slot_id) because the new
encoded row still fits in the existing heap slot.
Since 6.20, the executor may emit many UpdateInPlace records through one
record_update_in_place_batch(...) call. The on-disk format does not change:
the optimization is only in how normal entries are serialized and appended
(reserve_lsns(...) + write_batch(...) once per statement instead of one append
call per row).
Binary Format
Field Value
─────────────── ───────────────────────────────────────────────────────────────
entry_type 10 (UpdateInPlace)
table_id target table ID
key logical row key carried by the caller
old_value [page_id:8][slot_id:2][old tuple image...]
new_value [page_id:8][slot_id:2][new tuple image...]
The tuple image is the full logical row image stored in the slot:
[RowHeader || encoded row bytes]
Undo and crash recovery decode the physical location from the first 10 bytes and then restore the old tuple image directly into the same slot.
Why a New Entry Type Instead of Reusing Update
Classic Update in AxiomDB means logical delete+insert and therefore carries two
different physical locations. UpdateInPlace means “same physical location, bytes
changed in place”. Reusing Update would blur those two recovery contracts and make
undo logic branch on payload shape instead of entry type.
Undo and Recovery
Rollback and crash recovery treat UpdateInPlace as a direct restore:
read page(page_id)
restore old tuple image at slot_id
write page(page_id)
If the transaction committed, recovery leaves the rewritten bytes in place. If the
transaction did not commit, recovery restores old_value to the same slot.
Clustered Mutation Entries (Phases 39.11 / 39.12)
Phase 39.11 adds the first WAL contract for clustered rows, and Phase 39.12
extends it into clustered crash recovery:
key = encoded primary-key bytes
old_value = ClusteredRowImage? // absent on insert
new_value = ClusteredRowImage? // absent on pure delete undo payload
ClusteredRowImage:
[root_pid: u64]
[RowHeader: 24B]
[row_len: u32]
[row_data bytes]
TxnManager now tracks the latest clustered root_pid per table_id inside the
active transaction. Rollback and ROLLBACK TO SAVEPOINT use that tracked root
and clustered-tree helpers:
- undo clustered insert →
delete_physical_by_key(...) - undo clustered delete-mark / update →
restore_exact_row_image(...)
Phases 39.14, 39.16, and 39.17 are the first SQL-visible executor users of that contract:
- a fresh clustered SQL insert records
ClusteredInsert - reusing a snapshot-invisible delete-marked clustered PK records
ClusteredUpdate, because rollback must restore the old tombstone image, not simply delete the new row - clustered SQL update now records the exact old clustered row image before the rewrite, even for same-leaf in-place updates and relocate-updates
- clustered SQL delete now records the exact old clustered row image before the
delete-mark so rollback/savepoints can restore the prior
txn_id_deleted = 0state exactly - clustered secondary bookmark entries still use the ordinary B+ Tree undo path,
but
39.16extends that undo to both halves of a rewritten secondary key: rollback can delete newly inserted bookmark entries and reinsert the old physical bookmark entry against the current index root
The invariant is intentionally logical: rollback restores the old primary-key row state, not the exact pre-change page topology. A relocate-update may split or merge the tree on the forward path, and rollback may restore the old row into a different physical leaf as long as the visible row state matches the original.
39.12 now uses the same payloads during crash recovery:
- reverse-undo in-progress clustered inserts by
delete_physical_by_key(...) - reverse-undo in-progress clustered delete-mark/update by
restore_exact_row_image(...) - track the current clustered root per table while recovery undoes those writes
- seed
TxnManager::open_with_recovery(...)with the final recovered root map
TxnManager::open(...) also reconstructs the latest committed clustered root
per table from surviving WAL history on a clean reopen.
Checkpoint Protocol — 5 Steps
A checkpoint ensures that all dirty pages below a given LSN are written to the .db
file so that WAL entries before that LSN can be safely truncated.
Step 1: Write a Checkpoint entry to the WAL with the current LSN.
This entry marks the start of the checkpoint.
Step 2: Call storage.flush() — ensures all dirty mmap pages are written
to disk via msync(). After this point, every page modification
with LSN ≤ checkpoint_lsn is on disk.
Step 3: Update the meta page (page 0) with the new checkpoint_lsn.
This is the commit point: if we crash after step 3, recovery
can skip all WAL entries with LSN ≤ checkpoint_lsn.
Step 4: Write the updated meta page to disk (flush again, just for page 0).
Step 5: Optionally truncate the WAL file, removing all entries with
LSN ≤ checkpoint_lsn. (WAL rotation is planned — currently
the WAL grows indefinitely and is truncated on checkpoint.)
If the process crashes between step 2 and step 3, the checkpoint LSN in the meta page still points to the previous checkpoint. Recovery replays from the old checkpoint LSN — this is safe because step 2 already flushed the pages.
Crash Recovery State Machine
AxiomDB tracks its recovery state through five well-defined phases. The state transitions are strictly sequential; no transition can be skipped.
CRASHED
│
│ detect: last shutdown was not clean (no clean-close marker)
▼
RECOVERING
│
│ open .db file: verify meta page checksum and format version
│ open .wal file: verify WAL header magic and version
▼
REPLAYING_WAL
│
│ scan WAL forward from checkpoint_lsn
│ for each entry with LSN > checkpoint_lsn:
│ if entry.txn_id is in the committed_set:
│ replay the mutation (redo)
│ else:
│ skip (uncommitted changes are discarded by ignoring)
│
│ committed_set = {txn_id for all txn_ids with a Commit entry in the WAL}
▼
VERIFYING
│
│ run heap structural check (all slot offsets within bounds,
│ no overlapping tuples, free_start < free_end)
│ run MVCC consistency check (xmin ≤ xmax for all live rows)
▼
READY
│
│ normal operation resumes
Why no UNDO pass
AxiomDB’s replay path is redo-only for the classic heap WAL entries that are
already replayable. Uncommitted transactions are simply ignored during the
forward scan. Because that heap WAL records physical locations (page_id, slot_id), the page that contained the uncommitted write is overwritten with the
committed state from the WAL. If the page has no committed mutations after the
checkpoint, it retains its pre-crash state (which was correct, because the
checkpoint flushed all committed changes up to checkpoint_lsn).
This avoids the UNDO pass required by logical WALs (like PostgreSQL’s pg_wal), which must undo changes to B+ Tree pages in reverse order. Physical WAL with redo-only recovery is simpler and faster.
page_id + slot_id) requires only one forward pass — uncommitted writes are simply overwritten by committed redo entries.
For clustered entries, 39.12 adds the first recovery extension on top of that
model: unresolved clustered transactions are now undone by primary key and exact
row image instead of returning NotImplemented. The remaining gap is narrower:
clustered root persistence still depends on surviving WAL history and is not yet
checkpoint/rotation-stable.
WalReader Design
WalReader is stateless. It stores only the file path. Each scan call opens a new
File handle.
Forward scan (scan_forward): uses BufReader<File> to amortize syscall overhead
on sequential reads. Reads are sequential and predictable — the OS readahead prefetches
the next WAL sectors automatically.
Backward scan (scan_backward): uses a seekable File directly. BufReader
would be counterproductive here because seeks invalidate the read buffer. Each backward
step seeks to current_pos - 4 to read entry_len_2, then seeks back to
current_pos - entry_len_2 to read the full entry.
Corruption handling: both iterators return Result<WalEntry>. On the first corrupt
entry (truncated bytes, CRC mismatch, unknown entry type), the iterator yields an Err
and stops. The caller decides whether to propagate or recover gracefully.
WAL and Concurrency
ConcurrentWalWriter (Phase 40.4)
ConcurrentWalWriter replaces the single-threaded WalWriter inside TxnManager.
All public methods take &self — multiple transactions submit WAL entries without
serializing on a single exclusive lock.
Thread A Thread B
│ │
reserve_lsn()│ fetch_add(1,Relaxed) │reserve_lsn() ← lock-free ~2 ns
│ serialize entries │serialize entries ← fully parallel
│ │
Mutex<WriteQueue>::push() │ ← ~1 µs each
Mutex<WriteQueue>::push()
│ │
commit() │commit()
│ │
┌────▼──────────────────────▼────┐
│ Mutex<WriterState> (leader) │ ← one leader per fsync batch
│ drain_sorted() from queue │
│ write_entries() → BufWriter │
│ flush() → OS page cache │
│ fdatasync() → durable on disk │ ← one fsync covers all pending
│ flushed_lsn.fetch_max(...) │
└────────────────────────────────┘
Lock ordering (no deadlock):
submit_entry: acquiresqueue_mutexonly.flush_and_sync: acquireswriter_mutexfirst, thenqueue_mutexbriefly for drain.- No function holds
queue_mutexwhile waiting forwriter_mutex.
Drop behavior: ConcurrentWalWriter::drop() calls flush_no_sync() — drains
the queue and flushes the BufWriter to the OS page cache without fsync. This
mirrors BufWriter<File>::drop and preserves crash-simulation semantics (durability
tests call drop(mgr) to simulate a process exit with OS cache flushed).
Single-Writer Model (pre-40.4)
Before Phase 40.4, WAL writes serialized through a single WalWriter inside TxnManager.
The server runtime uses Arc<tokio::sync::RwLock<Database>>: readers may overlap,
but mutating statements still serialize behind the write guard. This eliminates
write-write conflicts without record-level locking (Phase 13.7 will lift this
constraint).
WAL Fsync Pipeline (Phase 6.19)
The old timer-based CommitCoordinator from 3.19 is now superseded in the
server path by an always-on leader-based fsync pipeline inspired by
MariaDB’s group_commit_lock.
Connections still write Commit entries into the WAL BufWriter, but the
handoff after that changed:
- the connection calls
pipeline.acquire(commit_lsn, txn_id) - if another leader already flushed past
commit_lsn→Expired - if no leader is active →
Acquired, this connection performsflush+fsync - if a leader is active →
Queued(rx), this connection releases the DB lock and awaits confirmation
Conn A → lock → DML → commit_deferred() → pipeline.acquire(42) → Acquired
flush+fsync → release_ok(42) → unlock → OK
Conn B → lock → DML → commit_deferred() → pipeline.acquire(43) → Queued(rx)
unlock → await rx ──────────────────────────────────────────────┐
Leader A fsync completes → flushed_lsn = 43 → wake B ─────────────────────────────┘
Conn C → lock → DML → commit_deferred() → pipeline.acquire(41) → Expired → OK
Durability Guarantee
A connection does not receive Ok until the fsync covering its Commit
entry completes. max_committed advances only after the leader confirms
durability. If the process crashes before that fsync, the transaction is lost
and no client received Ok. The durability guarantee is therefore identical to
inline fsync; only the scheduling changes.
Key Structures
| Component | Location | Role |
|---|---|---|
FsyncPipeline | axiomdb-wal/src/fsync_pipeline.rs | Shared state: flushed_lsn, leader_active, pending_lsn, waiter queue |
AcquireResult | same file | Expired / Acquired / Queued(rx) outcome for each commit |
TxnManager::deferred_commit_mode | axiomdb-wal/src/txn.rs | Internal hook used by the server path to defer inline fsync until the pipeline leader runs |
TxnManager::advance_committed() | same file | Advances max_committed to max(batch_txn_ids) after fsync |
Database::take_commit_rx() | axiomdb-network/src/mysql/database.rs | Bridges SQL execution to pipeline acquire / leader fsync / follower await |
PageWrite Entry (Phase 3.18)
WalEntry::PageWrite (entry type 9) replaces N Insert entries with one entry per
heap page during bulk inserts. Instead of serializing one entry per row, the executor
groups rows by their target page and writes a single entry per page.
key: page_id as u64 LE (8 bytes)
old_value: empty
new_value: [page_bytes: PAGE_SIZE][num_slots: u16 LE][slot_id × N: u16 LE]
The page_bytes field contains the full post-modification page (16 KB for the default
page size). The embedded slot_ids let crash recovery undo uncommitted PageWrite
entries at slot granularity — identical in effect to undoing N individual Insert entries.
CPU cost comparison for 10K-row bulk insert (~42 pages at 16 KB):
Insert path (3.17): 10,000 × serialize_into() + 10,000 × CRC32c ← O(N rows)
PageWrite (3.18): 42 × serialize_into() + 42 × CRC32c ← O(P pages) — 238× less
WAL file size comparison for 10K rows:
Insert entries: 10,000 × ~100B = ~1 MB
PageWrite: 42 × ~16.9 KB = ~710 KB ← 30% smaller
Crash recovery for uncommitted PageWrite:
for each PageWrite entry in uncommitted txn:
page_id = entry.key[0..8] as u64 LE
num_slots = entry.new_value[PAGE_SIZE..+2] as u16 LE
for i in 0..num_slots:
slot_id = entry.new_value[PAGE_SIZE+2+i*2..+2] as u16 LE
mark_slot_dead(storage, page_id, slot_id) // same as undoing Insert
Batch WAL Append (Phase 3.17)
For bulk inserts (INSERT INTO t VALUES (r1),(r2),...) TxnManager::record_insert_batch()
writes all N Insert WAL entries in a single write_all call:
Per-row path (before 3.17):
for each of N rows: append_with_buf(entry, scratch) ← N × write_all to BufWriter
Batch path (3.17):
lsn_base = wal.reserve_lsns(N)
for each row: entry.serialize_into(&mut wal_scratch) ← accumulate in RAM
wal.write_batch(&wal_scratch) ← 1 × write_all
The entries written to disk are byte-for-byte identical to the per-row path — crash recovery reads them the same way. The improvement is purely in CPU and syscall overhead: O(1) BufWriter calls instead of O(N).
Combined with HeapChain::insert_batch() (O(P) page writes for P pages) and
a single parse+analyze pass for multi-row VALUES, the full bulk INSERT pipeline
is O(P) in both storage I/O and WAL I/O, where P = number of pages filled ≈ N/200.
flush+fsync still runs outside that mutex and under the existing database write lock.
Compact PageWrite Format
The WalEntry::PageWrite entry was updated to eliminate the 16 KB page image:
Old format (per page):
new_value = [page_bytes: 16384 B][num_slots: u16 LE][slot_ids: u16 × N]
New compact format (per page):
new_value = [num_slots: u16 LE][slot_ids: u16 × N]
Crash recovery only needs slot IDs to mark inserted slots dead on undo — it never uses the stored page bytes. Eliminating them reduces WAL size from ~820 KB to ~20 KB per 10K-row batch (40× reduction).
MVCC and Transactions
Multi-Version Concurrency Control (MVCC) is AxiomDB’s mechanism for deciding which
row versions are visible to a given statement or transaction. This page documents
the current implementation: the RowHeader format, the actual
TransactionSnapshot type, the single-active-transaction TxnManager, and the
server’s Arc<RwLock<Database>> concurrency model.
Implementation status: current code implements snapshot visibility, READ COMMITTED and REPEATABLE READ semantics, rollback/savepoints, deferred page reclamation, and concurrent read-only queries. It does not yet implement row-level writer concurrency, deadlock detection,
SELECT ... FOR UPDATE, or full SSI. Those are planned in Phases 13.7, 13.8, and 13.8b.
Core Concepts
Transaction ID (TxnId)
Every explicit transaction receives a unique, monotonically increasing u64
identifier. The value 0 means “no active write transaction” and is used by
autocommit reads.
Transaction Snapshot
A snapshot is the compact visibility token used by the current runtime.
#![allow(unused)]
fn main() {
pub struct TransactionSnapshot {
pub snapshot_id: u64,
pub current_txn_id: u64,
}
}
Meaning:
snapshot_id = max_committed + 1at the moment the snapshot is takencurrent_txn_id = txn_idof the active transaction, or0for read-only / autocommit reads
A row version is visible when:
txn_id_created == current_txn_idortxn_id_created < snapshot_id- and
txn_id_deleted == 0, ortxn_id_deleted >= snapshot_idand the delete was not performed bycurrent_txn_id
RowHeader — Per-Row Versioning
Every heap tuple begins with a RowHeader:
Offset Size Field Description
──────── ────── ─────────────── ───────────────────────────────────────────────
0 8 txn_id_created transaction that inserted this row version
8 8 txn_id_deleted transaction that deleted this row (0 = live)
16 4 row_version incremented on UPDATE
20 4 _flags reserved for future use
Total: 24 bytes
The full lifecycle of a row version:
INSERT in txn T1:
RowHeader { txn_id_created: T1, txn_id_deleted: 0, row_version: 0 }
DELETE in txn T2:
RowHeader { txn_id_created: T1, txn_id_deleted: T2, row_version: 0 }
UPDATE in txn T2 (implemented as DELETE + INSERT):
Old version: RowHeader { txn_id_created: T1, txn_id_deleted: T2, row_version: N }
New version: RowHeader { txn_id_created: T2, txn_id_deleted: 0, row_version: N+1 }
Batch DELETE and Full-Table DELETE
When a DELETE has a WHERE clause, TableEngine::delete_rows_batch() collects all
matching (page_id, slot_id) pairs and calls HeapChain::delete_batch() with them.
Each affected slot receives xmax = txn_id and deleted = 1 in a single pass per
page. The WAL receives one WalEntry::Delete per matched row (for correct per-row
redo/undo).
When a DELETE has no WHERE clause or is a TRUNCATE TABLE, the executor takes a
different path:
HeapChain::scan_rids_visible()collects live(page_id, slot_id)pairs without decoding row data.HeapChain::delete_batch()marks all slots dead in O(P) page I/O.- A single
WalEntry::Truncateis appended to the WAL instead of N per-row Delete entries.
The MVCC visibility result is identical to the per-row path: every slot has
xmax = txn_id and deleted = 1, so any snapshot with xmax ≤ txn_id will see
the row as deleted after the transaction commits. Concurrent readers that took their
snapshot before this transaction began continue to see all rows as live throughout
the delete — standard snapshot isolation.
Visibility Function
#![allow(unused)]
fn main() {
fn is_visible(row: &RowHeader, snap: &TransactionSnapshot, self_txn_id: u64) -> bool {
let created_visible =
row.txn_id_created == self_txn_id || row.txn_id_created < snap.snapshot_id;
let not_deleted =
row.txn_id_deleted == 0
|| (row.txn_id_deleted >= snap.snapshot_id
&& row.txn_id_deleted != self_txn_id);
created_visible && not_deleted
}
}
TxnManager
The current TxnManager is a single-active-transaction coordinator. Read-only
operations access it via shared refs for snapshot creation; mutating operations
access it via &mut TxnManager for begin/commit/rollback.
#![allow(unused)]
fn main() {
pub struct TxnManager {
wal: WalWriter,
next_txn_id: u64,
max_committed: u64,
active: Option<ActiveTxn>,
}
}
This is the main reason the current server runtime is still single-writer for
mutating statements: there is only one ActiveTxn slot for the whole opened
database, not one write transaction owner per connection.
BEGIN
1. Verify `active.is_none()`
2. Assign `txn_id = next_txn_id`
3. Append `Begin` to the WAL
4. Set `active = Some(ActiveTxn { txn_id, snapshot_id_at_begin, ... })`
5. Increment `next_txn_id`
COMMIT
1. Append `Commit` to the WAL
2. Flush/fsync via the current durability policy or fsync pipeline
3. Advance `max_committed`
4. Clear `active`
ROLLBACK
1. Replay undo ops in reverse order
2. Append `Rollback` to the WAL
3. Clear `active`
Copy-on-Write B+ Tree and MVCC
The B+ Tree’s CoW semantics interact naturally with MVCC. When a writer creates a new page for an insert, concurrent readers continue accessing the old tree structure through the old root pointer they loaded at query start. The old pages are freed only when the writer’s root swap is complete AND all readers that loaded the old root have finished.
Since Phase 7.4, old pages enter the deferred free queue instead of being returned to the freelist immediately. This allows concurrent readers to continue accessing old tree structures through their snapshot while the writer has already swapped the root. Pages are released for reuse only when no active reader snapshot predates the free operation.
Current Server Lock Model (Phase 7.4 / 7.5)
The server wraps Database in Arc<RwLock<Database>>:
- SELECT, SHOW, system variable queries acquire a read lock (
db.read()). Multiple readers execute concurrently with zero coordination. - INSERT, UPDATE, DELETE, DDL, BEGIN/COMMIT/ROLLBACK acquire a write lock
(
db.write()). Only one writer at a time. - A read that already started keeps its snapshot while a writer commits.
- New mutating statements queue behind the write lock at whole-database granularity.
- Row-level locking is not implemented yet. That work starts in Phase 13.7.
The read-only executor path (execute_read_only_with_ctx) takes &dyn StorageEngine
(shared ref) and &TxnManager (shared ref), ensuring it cannot mutate any state.
Isolation Levels — Implementation
READ COMMITTED
On every statement start within a transaction, a new snapshot is taken. The
TransactionSnapshot passed to the analyzer and executor is refreshed per statement.
REPEATABLE READ
The snapshot is taken once at BEGIN and held for the entire transaction’s lifetime.
All statements use the same snapshot.
The default isolation level is REPEATABLE READ (matching MySQL’s default). Autocommit single-statement queries always use READ COMMITTED semantics since there is only one statement to see.
INSERT … SELECT — Snapshot Isolation
INSERT INTO target SELECT ... FROM source executes the SELECT under the same
snapshot that was fixed at BEGIN. This is critical for correctness:
The Halloween problem is a classic database bug where an INSERT ... SELECT
on the same table re-reads rows it just inserted, causing an infinite loop (the
database inserts rows, those rows qualify the SELECT condition, they get inserted
again, ad infinitum).
AxiomDB prevents this automatically through MVCC snapshot semantics:
- The snapshot is fixed at
BEGIN:snapshot_id = max_committed + 1 - Rows inserted by this statement get
txn_id_created = current_txn_id - The MVCC visibility rule: a row is visible only if
txn_id_created < snapshot_id - Since
current_txn_id ≥ snapshot_id, newly inserted rows are never visible to the SELECT scan within the same transaction
Before BEGIN: source = {row_A (xmin=1), row_B (xmin=2)}
Snapshot taken: snapshot_id = 3
INSERT INTO source SELECT * FROM source:
SELECT sees: row_A (1 < 3 ✅), row_B (2 < 3 ✅) → 2 rows
Inserts: row_C (xmin=3), row_D (xmin=3) → 3 ≮ 3 ❌ not re-read
SELECT stops: only 2 original rows were seen
After COMMIT: source = {row_A, row_B, row_C, row_D} ← exactly 4 rows
This also means rows inserted by a concurrent transaction that commits after
this transaction’s BEGIN are not seen by the SELECT — consistent snapshot
throughout the entire INSERT operation.
MVCC on Secondary Indexes (Phase 7.3b)
Secondary indexes store (key, RecordId) pairs — they do not contain transaction
IDs or version information. Visibility is always determined at the heap via the row’s
txn_id_created / txn_id_deleted fields.
Lazy Index Deletion
When a row is DELETEd, non-unique secondary index entries are not removed. The
heap row is marked deleted (txn_id_deleted = T), and the index entry becomes a
“dead” entry. Readers filter dead entries via is_slot_visible() during index scans.
Unique, primary key, and FK auto-indexes still have their entries deleted immediately because the B-Tree enforces key uniqueness internally.
UPDATE and Dead Entries
When an UPDATE changes an indexed column:
- Unique/PK/FK indexes: old entry deleted, new entry inserted (immediate)
- Non-unique indexes: old entry left in place (lazy), new entry inserted
Both old and new entries coexist in the B-Tree. The old entry points to a heap row
whose values no longer match the index key; is_slot_visible() filters it out.
Heap-Aware Uniqueness
When inserting into a unique index, if the key already exists, AxiomDB checks heap
visibility before raising a UniqueViolation. If the existing entry points to a dead
row (deleted or uncommitted), the insert proceeds — dead entries don’t block re-use
of the same key value.
HOT Optimization
If an UPDATE does not change any column that participates in any secondary index, all index maintenance is skipped for that row — no B-Tree reads or writes. This is inspired by PostgreSQL’s Heap-Only Tuple (HOT) optimization.
ROLLBACK Support
Every new index entry (from INSERT or UPDATE) is recorded as
UndoOp::UndoIndexInsert in the transaction’s undo log. On ROLLBACK, these entries
are physically removed from the B-Tree. Old entries (from lazy delete) were never
removed, so they’re naturally restored.
Vacuum
Dead index entries accumulate until vacuum removes them. A dead entry is one where
is_slot_visible(entry.rid, oldest_active_snapshot) returns false — the pointed-to
heap row is deleted and no active snapshot can see it.
VACUUM — Dead Row and Index Cleanup (Phase 7.11)
The VACUUM command physically removes dead rows and dead index entries:
VACUUM orders; -- vacuum a specific table
VACUUM; -- vacuum all tables
Heap Vacuum
For each page in the heap chain, VACUUM finds slots where txn_id_deleted != 0
and txn_id_deleted < oldest_safe_txn (the deletion is committed and no active
snapshot can see it). These slots are zeroed via mark_slot_dead(), making them
invisible to read_tuple() without even reading the RowHeader.
Under the current Arc<RwLock<Database>> architecture, oldest_safe_txn = max_committed + 1 — all committed deletions are safe because no reader holds an
older snapshot.
Index Vacuum
For each catalog-visible B-Tree index, VACUUM performs a full B-Tree scan and checks heap visibility for each entry. Dead entries (pointing to vacuumed or deleted heap slots) are batch-deleted from the B-Tree. If bulk delete rotates the root, the updated root page ID is persisted back to the catalog in the same transaction.
Clustered Vacuum
Clustered tables now use a different purge path:
- descend to the leftmost clustered leaf, then walk
next_leaf - purge leaf cells whose
txn_id_deleted != 0 && txn_id_deleted < oldest_safe_txn - free any overflow chain owned by the purged row
- conditionally defragment clustered leaves with high freeblock waste
- scan clustered secondary indexes and delete only entries whose PK bookmark no longer resolves to a physically present clustered row
This last rule is important: clustered secondary cleanup uses physical existence after purge, not snapshot visibility. An uncommitted clustered delete is invisible to the writer snapshot, but it is not safe to purge.
What VACUUM Does Not Do (Yet)
- VACUUM FULL / table rewrite: heap pages still do slot-level cleanup only, while clustered pages only do local defragmentation; there is no full-table rewrite pass yet.
- Automatic triggering: VACUUM must be invoked manually via SQL. Autovacuum with threshold-based triggering is planned.
LP_UNUSED and
updates a free space map (FSM) so the space can be reused. AxiomDB keeps the
heap path simpler: dead slots are zeroed but heap pages are not compacted or
tracked through an FSM yet. Clustered VACUUM now does local leaf defragmentation
instead, while full-table rewrite remains a separate enhancement.
Epoch-Based Page Reclamation (Phase 7.8)
When a writer performs Copy-on-Write on a B-Tree node, old pages are deferred
(not immediately freed) because a concurrent reader might still reference them.
The SnapshotRegistry tracks which snapshots are active across all connections:
#![allow(unused)]
fn main() {
pub struct SnapshotRegistry {
slots: Vec<AtomicU64>, // slot[conn_id] = snapshot_id or 0
}
}
- Register: connection sets its slot before executing a read query
- Unregister: connection clears its slot after the query completes
- oldest_active(): returns the minimum non-zero slot, or
u64::MAXif idle
On flush(), the storage layer calls release_deferred_frees(oldest_active())
to return only pages freed before the oldest active snapshot to the freelist.
Pages freed after the oldest snapshot remain queued until all readers advance.
lowest_active_start via an active transaction list.
InnoDB uses clone_oldest_view() to merge all active ReadViews.
AxiomDB uses a fixed-size atomic slot array (1024 slots) — O(N) scan without
locking. Under the current RwLock model all slots are 0 during flush (writer
has exclusive access), so the behavior is identical to the previous
u64::MAX sentinel. The infrastructure is forward-compatible with
future concurrent reader+writer models.
Clustered UPDATE In-Place Undo (Phase 39.22)
UndoClusteredFieldPatch
When fused_clustered_scan_patch applies zero-allocation in-place UPDATE
(writing only the changed field bytes directly into the page buffer), the WAL
undo log records a UndoClusteredFieldPatch entry instead of the full-row-image
UndoClusteredRestore:
#![allow(unused)]
fn main() {
UndoClusteredFieldPatch {
table_id: u32,
key: Vec<u8>, // PK bytes for leaf descent
old_header: RowHeader, // txn_id_created, row_version to restore
field_deltas: Vec<FieldDelta>, // each carries [u8;8] old_bytes
}
}
On ROLLBACK, the handler:
- Looks up the clustered root via
clustered_roots - Descends to the owning leaf via
clustered_tree::descend_to_leaf_pub - Searches for the cell by PK key via
clustered_leaf::search - For each
FieldDelta, computesfield_abs_off = row_data_abs_off + delta.offsetand callspatch_field_in_placewithdelta.old_bytes[..delta.size] - Restores the
RowHeaderviaupdate_row_header_in_place
This is O(fields_changed × 1) per row, vs O(row_size) for a full UndoClusteredRestore.
FieldDelta Inline Arrays
FieldDelta stores field bytes as fixed-size [u8;8] arrays (InnoDB field
values for fixed-size types are at most 8 bytes for BIGINT/REAL):
#![allow(unused)]
fn main() {
// Before Phase 39.22
pub struct FieldDelta { pub offset: u16, pub size: u8, pub old_bytes: Vec<u8>, pub new_bytes: Vec<u8> }
// After Phase 39.22
pub struct FieldDelta { pub offset: u16, pub size: u8, pub old_bytes: [u8;8], pub new_bytes: [u8;8] }
}
WAL serialization writes only size bytes, so the on-disk format is identical.
Recovery code reads back (u16, u8, &[u8]) tuples and copies them into [u8;8]
arrays — no heap allocation during recovery either.
UndoClusteredFieldPatch stores them as inline [u8;8] arrays in the undo log entry — no heap allocation per field per row. For UPDATE t SET score = score + 1 across 25K rows, this eliminates ~50K allocations vs the old Vec<u8>-per-delta approach.
⚠️ Planned: Serializable Snapshot Isolation (Phase 7)
SSI detects read-write dependencies between concurrent transactions and aborts transactions that form a dangerous cycle. The implementation follows the algorithm from Cahill et al. (2008):
- Each transaction tracks its
rw-antidependencies(read sets and write sets). - At commit time, if the dependency graph contains a dangerous cycle (two transactions
where each reads something the other wrote), one transaction is aborted with
40001 serialization_failure.
SSI provides true serializability (the strongest isolation level) with overhead proportional to the number of concurrent transactions and conflicts, not to the total number of rows.
B+ Tree — Hybrid Write Model
AxiomDB’s indexing layer is a persistent B+ Tree implemented over the StorageEngine
trait. Every index — including primary key and unique constraint indexes — is one such
tree.
Write model (Phase 5)
The tree uses a hybrid write model that minimizes page I/O on the hot path while keeping structural operations (splits, merges, rotations) on the safe allocate-new path:
| Operation | Write path | Alloc/free |
|---|---|---|
| Insert, no leaf split | In-place: same leaf page ID | 0 alloc / 0 free |
| Insert, child split absorbed by non-full parent | In-place: same parent page ID | 0 alloc / 0 free for the parent |
| Insert, leaf or internal split | Structural: alloc 2 new pages, free 1 | 2 alloc / 1 free |
| Delete, leaf stays ≥ MIN_KEYS_LEAF | In-place: same leaf page ID | 0 alloc / 0 free |
| Delete, parent pointer unchanged after child delete | Skip parent rewrite entirely | 0 alloc / 0 free for the parent |
| Delete, leaf underflows → rebalance | Structural: alloc new leaf | 1 alloc / 1 free |
| Batch delete, sorted exact keys | Page-local merge delete + one parent normalization pass | 0 alloc / 0 free on non-underfull pages; structural only where underflow happens |
This is the Phase 5 model for a serialized single writer (&mut self). Phase 7 will
reintroduce the full Copy-on-Write path to reconcile with lock-free readers and epoch
reclamation.
&mut self on all mutations).
Lock-free readers and epoch-based reclamation (Phase 7) will determine how much of the
in-place model can be retained under concurrent read traffic.
Batch delete (delete_many_in) — sorted single-pass
Phase 5.19 adds a second delete mode to the tree:
#![allow(unused)]
fn main() {
BTree::delete_many_in(storage, &root_pid, &sorted_keys)
}
The contract is deliberately narrow:
- the caller already knows the exact encoded keys to delete
- keys are already sorted ascending
- the tree does no predicate evaluation and no SQL-layer reasoning
The algorithm is page-local and ordered:
- Leaf pages: merge the leaf’s sorted key array with the sorted delete slice and write one compacted survivor image.
- Internal pages: partition the delete slice by child range, recurse once per affected child, then normalize the parent once.
- Root collapse: run once at the very end of the batch.
This avoids the old N × delete_in(...) pattern where every key started from
the root and independently decided whether to rewrite or rebalance the same
pages.
Page Capacity — Deriving ORDER_INTERNAL and ORDER_LEAF
Both node types must fit within PAGE_BODY_SIZE = 16,320 bytes (16 KB minus the
64-byte header). Each key occupies at most MAX_KEY_LEN = 64 bytes (zero-padded
on disk).
Internal Node Capacity
An internal node with n separator keys has n + 1 child pointers.
Header: 1 (is_leaf) + 1 (_pad) + 2 (num_keys) + 4 (_pad) = 8 bytes
key_lens: n × 1 = n bytes
children: (n + 1) × 8 = 8n + 8 bytes
keys: n × 64 = 64n bytes
Total = 8 + n + (8n + 8) + 64n = 16 + 73n
Solving 16 + 73n ≤ 16,320:
73n ≤ 16,304
n ≤ 223.3
ORDER_INTERNAL = 223 (largest integer satisfying the constraint).
Total size: 16 + 73 × 223 = 16 + 16,279 = 16,295 bytes ≤ 16,320 ✓
Leaf Node Capacity
A leaf node with n entries stores n keys and n record IDs. A RecordId
is 10 bytes: page_id (u64, 8 bytes) + slot_id (u16, 2 bytes).
Header: 1 (is_leaf) + 1 (_pad) + 2 (num_keys) + 4 (_pad) + 8 (next_leaf) = 16 bytes
key_lens: n × 1 = n bytes
rids: n × 10 = 10n bytes
keys: n × 64 = 64n bytes
Total = 16 + n + 10n + 64n = 16 + 75n
Solving 16 + 75n ≤ 16,320:
75n ≤ 16,304
n ≤ 217.4
ORDER_LEAF = 217 (largest integer satisfying the constraint).
Total size: 16 + 75 × 217 = 16 + 16,275 = 16,291 bytes ≤ 16,320 ✓
On-Disk Page Layout
Both node types use #[repr(C)] structs with all-u8-array fields so that
bytemuck::Pod (zero-copy cast) is safe without any implicit padding. All
multi-byte fields are stored little-endian.
Internal Node (InternalNodePage)
Offset Size Field Description
──────── ────── ─────────── ─────────────────────────────────────────────
0 1 is_leaf always 0
1 1 _pad0 alignment
2 2 num_keys number of separator keys (u16 LE)
4 4 _pad1 alignment
8 223 key_lens actual byte length of each key (0 = empty slot)
231 1,792 children 224 × [u8;8] — child page IDs (u64 LE each)
2,023 14,272 keys 223 × [u8;64] — separator keys, zero-padded
──────── ────── ─────────── ──────────────────────────────
Total: 16,295 bytes ≤ PAGE_BODY_SIZE ✓
This fixed-layout page is still the format used by the current production
axiomdb-index::BTree. Phase 39 does not mutate this structure in place.
Instead, the clustered rewrite is introducing separate storage-layer page
primitives for clustered leaves and clustered internal nodes.
Clustered Internal Primitive (Phase 39.2)
The new clustered internal page lives in axiomdb-storage, not in the current
axiomdb-index tree code. It uses a slotted variable-size layout:
[ClusteredInternalHeader: 16B]
is_leaf = 0
num_cells
cell_content_start
freeblock_offset
leftmost_child
[CellPtr array]
[Free gap]
[Cells: right_child | key_len | key_bytes]
The important compatibility rule is semantic, not structural:
- separator keys stay sorted
find_child_idx(search_key)still returns the first separator strictly greater than the search key- logical child
0comes fromleftmost_child - logical child
i > 0comes from separator celli - 1
That lets the clustered storage rewrite preserve B-tree navigation behavior
without reusing the old fixed-size MAX_KEY_LEN = 64 layout.
Clustered Insert Controller (Phase 39.3)
Phase 39.3 does not retrofit the current axiomdb-index::BTree into a
generic tree over fixed and clustered pages. Instead, axiomdb-storage
contains a dedicated controller in clustered_tree.rs that proves the first
full write path for clustered pages while the SQL executor still uses the
classic heap + index engine.
Algorithm shape:
insert(storage, root_opt, ...)bootstraps a clustered leaf root if needed.- Recursive descent chooses child pointers from
ClusteredInternal. - Leaf inserts stay in-place when the physical clustered-row descriptor fits.
- Large logical rows use local-prefix + overflow-page descriptors instead of an inline-only reject path.
- Fragmented leaves/internal pages call
defragment()once before split. - Leaf splits rebuild left/right pages by cumulative cell footprint.
- Internal splits rebuild left/right separator sets and promote one separator.
- Root overflow creates a fresh
ClusteredInternalroot.
Unlike the old structural Copy-on-Write tree, clustered 39.3 keeps the old
page ID as the left half on split and allocates only the new right sibling.
That is a conscious storage-first choice for the current single-writer runtime,
not the final concurrency model.
Clustered Point Lookup Controller (Phase 39.4)
Phase 39.4 extends that dedicated clustered controller with exact point
lookup:
- descend internal pages by separator key
- search the target leaf by exact key
- reconstruct the full logical row from the local leaf payload plus any overflow-page tail
- filter the hit through
RowHeader::is_visible(snapshot)
The important scope cut is semantic rather than structural: the controller can
read the current inline row version, but it cannot yet chase older versions
because clustered older-version reconstruction is still future work. 39.11
adds rollback/savepoint restore for clustered writes, but not undo-aware read
traversal for arbitrary snapshots.
That means the current lookup(...) contract is:
- visible hit →
Some(ClusteredRow) - key absent →
None - current inline version invisible →
None
This is a deliberate intermediate contract for the storage rewrite, not the final SQL-visible clustered read semantics.
Clustered Range Scan Controller (Phase 39.5)
Phase 39.5 adds the first ordered scan controller on top of the clustered
pages:
- determine the first relevant leaf for the lower bound
- determine the first relevant slot inside that leaf
- reconstruct and yield visible logical rows in ascending primary-key order
- follow
next_leafacross the leaf chain - stop as soon as the upper bound is exceeded
The controller is intentionally separate from the old fixed-layout
axiomdb-index::RangeIter. The two trees now have different physical layouts
and different row payload semantics:
- classic B+ Tree leaf:
(key, RecordId) - clustered leaf:
(key, RowHeader, total_row_len, local_prefix, overflow_ptr?)
So the right reuse point is the iterator shape, not the implementation.
The semantic boundary remains the same as in 39.4: the range iterator can
return or skip only the current inline version. Undo-aware older-version
reconstruction is still future work.
Clustered Update Controller (Phase 39.6)
Phase 39.6 adds the first mutation path that rewrites an existing clustered
row in place:
- descend to the owning leaf by exact primary key
- check visibility of the current inline version
- build the replacement inline header with a bumped
row_version - materialize either an inline or overflow-backed replacement descriptor
- rewrite the exact leaf cell without changing the key
- keep the row in the same leaf or fail explicitly
This controller is intentionally narrower than a full B-tree update:
- it does not move the row to another leaf
- it does not split or merge the tree
- it does not touch parent separators
- it does not maintain secondary indexes yet
That makes 39.6 a true clustered-storage step, not a disguised merge of
39.6, 39.7, 39.8, and 39.9.
The page-local rewrite itself has two modes:
- overwrite the existing cell directly when the replacement encoded payload fits the current cell budget
- rebuild the same leaf compactly when the row grows but still fits on that page
If neither is possible, the controller returns HeapPageFull.
Clustered Delete Controller (Phase 39.7)
Phase 39.7 adds the first logical delete path over clustered rows:
- descend to the owning leaf by exact primary key
- check visibility of the current inline version
- preserve the existing key, row payload,
txn_id_created, androw_version - stamp
txn_id_deleted = delete_txn_id - persist the same leaf page without structural tree change
This controller is intentionally narrower than a full B-tree delete:
- it does not remove the physical cell
- it does not merge or rebalance leaves
- it does not change parent separators
- it does not maintain secondary indexes yet
That makes 39.7 the logical-delete companion to 39.6, not a disguised
merge of 39.7, 39.8, 39.11, and 39.18.
Clustered Structural Controller (Phase 39.8)
Phase 39.8 adds the first controller that can structurally shrink and
rebalance the clustered tree:
- call
update_in_place(...)as the fast path - on
HeapPageFull, load the visible current row - physically delete the exact clustered cell through a private tree path
- propagate two signals upward:
underfull— the child now needs sibling redistribute/mergemin_changed— the child’s minimum key changed and the parent separator must be repaired
- rebalance clustered leaf and internal siblings by encoded byte volume
- collapse an empty internal root
- reinsert the replacement row through the clustered insert controller
That makes 39.8 the structural companion to 39.6 and 39.7, not a
shortcut around later purge / undo / secondary-index phases.
Current 39.8 limits remain explicit:
- relocate-update still rewrites only the current inline version
- public delete still does not purge dead clustered cells
- parent separator repair does not yet split the parent if the repaired key itself overflows the page budget
Clustered Secondary Bookmarks (Phase 39.9)
Phase 39.9 adds a dedicated bookmark-bearing secondary-key layout in
axiomdb-sql::clustered_secondary.
Instead of treating the BTree payload RecordId as the row locator, the
clustered path now encodes the physical secondary key as:
secondary_logical_key ++ missing_primary_key_columns
That gives the future clustered executor the exact secondary -> primary key
bridge it needs:
- scan the secondary B-tree by logical key prefix
- decode the appended PK bookmark from the physical secondary key
- probe the clustered tree by that primary key
This subphase is intentionally narrower than full executor integration:
- heap-visible SQL still uses
RecordId-based secondaries - clustered bookmark scans are a dedicated path, not a replacement for the old planner/executor yet
- unique clustered secondaries check logical-key conflicts before insert, even though the physical key contains a PK suffix for stable row identity
Clustered Overflow Pages (Phase 39.10)
Phase 39.10 adds the first large-row storage layer for clustered leaves.
The physical clustered leaf cell is now:
[key_len: u16]
[total_row_len: u32]
[RowHeader: 24B]
[key bytes]
[local row prefix]
[overflow_first_page?: u64]
And the overflow-page chain is:
[next_overflow_page: u64]
[payload bytes]
Important invariant:
- split / merge / rebalance reason about the physical leaf footprint
- lookup / range reconstruct the logical row bytes only when returning rows
- the primary key and
RowHeadernever leave the clustered leaf page
This is still narrower than full large-value support:
- no generic TOAST/BLOB reference layer yet
- no compression yet
39.11adds internal WAL / rollback for clustered row images39.12adds clustered crash recovery for those row images- delete-mark keeps the overflow chain reachable until later purge
Clustered WAL and Recovery (Phases 39.11 / 39.12)
Phases 39.11 and 39.12 add the first clustered durability contract on top
of the new page formats:
- clustered inserts append
EntryType::ClusteredInsert - clustered delete-marks append
EntryType::ClusteredDeleteMark - clustered updates append
EntryType::ClusteredUpdate - each WAL
keyis the primary key, not a physical slot identifier - each payload stores an exact
ClusteredRowImage TxnManagertracks the latest clustered root pertable_id- rollback and crash recovery restore logical row state by primary key and exact row image
This controller is still intentionally narrower than a full topology-physical
recovery story. 39.12 closes clustered crash recovery by reusing the same
PK + row-image semantics, while exact root persistence beyond WAL
checkpoint/rotation remains future work.
Rollback therefore promises logical row restoration, not exact physical topology restoration. A relocate-update may leave a different split/merge shape after rollback as long as the old primary-key row is back.
SQL-Visible Clustered INSERT (Phase 39.14)
Phase 39.14 is the first point where the SQL executor writes into the
clustered tree instead of only the storage tests doing so.
The executor branch now does this:
- resolve the clustered table plus its logical primary index metadata
- derive PK bytes from that primary-index column order
- check for a visible existing PK through clustered lookup
- insert the new row through
clustered_tree::insert(...), or reuse a snapshot-invisible delete-marked physical key throughrestore_exact_row_image(...) - maintain non-primary indexes as
secondary_key ++ pk_suffixbookmarks - persist the final clustered table root and any changed secondary roots
Fresh clustered keys emit EntryType::ClusteredInsert. Reused delete-marked
physical keys emit EntryType::ClusteredUpdate so rollback can restore the old
tombstone image instead of merely deleting the newly-inserted row.
This is still narrower than final clustered SQL behavior:
- clustered
UPDATEis now SQL-visible in39.16 - clustered
DELETEis now SQL-visible as logical delete-mark in39.17 - clustered secondary covering reads still normalize back to clustered row fetches until a true clustered index-only optimization exists
- older-snapshot reconstruction after reusing a tombstoned PK still depends on later clustered version-chain work
Leaf Node (LeafNodePage)
Offset Size Field Description
──────── ────── ─────────── ─────────────────────────────────────────────
0 1 is_leaf always 1
1 1 _pad0 alignment
2 2 num_keys number of (key, rid) pairs (u16 LE)
4 4 _pad1 alignment
8 8 next_leaf page_id of the next leaf (u64 LE); u64::MAX = no next
16 217 key_lens actual byte length of each key
233 2,170 rids 217 × [u8;10] — RecordId (page_id:8 + slot_id:2)
2,403 13,888 keys 217 × [u8;64] — keys, zero-padded
──────── ────── ─────────── ──────────────────────────────
Total: 16,291 bytes ≤ PAGE_BODY_SIZE ✓
Copy-on-Write Root Swap
The root page ID is stored in an AtomicU64. Writers and readers interact with
it as follows.
Reader Path
#![allow(unused)]
fn main() {
// Acquire load: guaranteed to see all writes that happened before
// the Release store that set this root.
let root_id = self.root.load(Ordering::Acquire);
let root_page = storage.read_page(root_id)?;
// traverse down — no locks acquired
}
Writer Path
#![allow(unused)]
fn main() {
// 1. Load the current root
let old_root_id = self.root.load(Ordering::Acquire);
// 2. Walk from old_root down to the target leaf, collecting the path
let path = find_path(&storage, old_root_id, key)?;
// 3. For each node on the path (leaf first, then up to root):
// a. alloc_page → new_page_id
// b. copy content from old page
// c. apply the mutation (insert key/split/rebalance)
// d. update the parent's child pointer to new_page_id
// 4. The new root was written as a new page
let new_root_id = path[0].new_page_id;
// 5. Atomic swap — Release store: all prior writes visible to Acquire loads
self.root.store(new_root_id, Ordering::Release);
// 6. Free the old path pages (only safe after all readers have moved on)
for old_id in old_page_ids { storage.free_page(old_id)?; }
}
A reader that loaded old_root_id before the swap continues accessing old pages
safely — they are freed only after all reads complete (tracked in Phase 7 with
epoch-based reclamation).
Acquire semantics and traverse the tree without acquiring any lock. A write in progress is invisible to readers until the Release store completes — at which point the entire new subtree is already consistent. This is what allows read throughput to scale linearly with core count.
Why next_leaf Is Not Used in Range Scans
The leaf node format includes a next_leaf pointer for a traditional linked-list
traversal across leaf nodes. However, this pointer is not used by RangeIter.
Reason: Under CoW, when a leaf is split or modified, a new page is created. The
previous leaf in the linked list still points to the old page (L_old), which has
already been freed. Keeping the linked list consistent under CoW would require copying
the previous leaf on every split — but finding the previous leaf during an insert
requires traversing from the root (the tree has no backward pointers).
Adopted solution: RangeIter re-traverses the tree from the root to find the
next leaf when crossing a leaf boundary. The cost is O(log n) per boundary crossing,
not O(1) as with a linked list. For a tree of 1 billion rows with ORDER_LEAF = 217,
the depth is log₂₁₇(10⁹) ≈ 4, so each boundary crossing is 4 page reads.
Measured cost for a range scan of 10,000 rows: 0.61 ms — well within the 45 ms budget.
next_leaf pointer exists on-disk but RangeIter does not use it. Under CoW, keeping a consistent linked list would require copying the previous leaf on every split — which itself requires finding that leaf from the root. Re-traversal costs O(log n) per leaf boundary (4 reads at 1B rows) and is simpler to reason about correctly.
Insert — CoW Split Protocol
1. Descend from root to the target leaf, recording the path.
2. If the leaf has room (num_keys < fill_threshold):
→ Copy the leaf to a new page.
→ Insert the new (key, rid) in sorted position.
→ Update the parent's child pointer (CoW the parent too).
→ Propagate CoW up to the root.
3. If the leaf is at or above the fill threshold:
→ Allocate two new leaf pages.
→ Distribute: left gets floor((ORDER_LEAF+1)/2) entries,
right gets the remaining entries.
→ The smallest key of the right leaf becomes the separator key
pushed up to the parent.
→ CoW the parent, insert the new separator and child pointer.
→ If the parent is also full, recursively split upward.
→ If the root splits, allocate a new root with two children.
The split point fill_threshold depends on the index fill factor (see below).
Internal pages always split at ORDER_INTERNAL regardless of fill factor.
Fill Factor — Adaptive Leaf Splits
The fill factor controls how full leaf pages are allowed to get before splitting.
It is set per-index via WITH (fillfactor=N) on CREATE INDEX and stored in
IndexDef.fillfactor: u8.
Formula
fill_threshold(order, ff) = ⌈order × ff / 100⌉ (integer ceiling division)
| fillfactor | fill_threshold (ORDER_LEAF = 216) | Effect |
|---|---|---|
| 100 (compact) | 216 | Splits only when completely full — max density, slowest inserts on busy pages |
| 90 (default) | 195 | Leaves ~10% free — balances density and insert speed |
| 70 (write-heavy) | 152 | Leaves ~30% free — fewer splits for append-heavy workloads |
| 10 (minimum) | 22 | Very sparse pages — extreme fragmentation, rarely useful |
A compile-time assert verifies that fill_threshold(ORDER_LEAF, 100) == ORDER_LEAF,
ensuring fillfactor=100 always preserves the original behavior exactly.
ORDER_INTERNAL. Only leaf splits benefit from the extra free space, because
inserts always land on leaf pages. Applying fill factor to internal pages would reduce
tree fan-out without any benefit for typical insert patterns, matching PostgreSQL's
implementation of the same concept.
Catalog field
IndexDef.fillfactor is serialized as a single byte appended after the predicate
section in the catalog heap entry. Pre-6.8 index rows are read with a default of 90
(backward-compatible). Valid range: 10–100; values outside this range are rejected
at CREATE INDEX parse time with a ParseError.
When to use a lower fill factor
- Append-heavy tables — rows inserted in bulk after the index is created. A fill factor of 70–80 prevents cascading splits during the bulk load.
- Write-heavy OLTP — high-frequency single-row inserts that land on the same hot pages. More free space means fewer page splits per second.
- Read-heavy / archival — use fillfactor=100. Maximum density reduces I/O for full scans at the cost of slower writes.
Minimum Occupancy Invariant
All nodes except the root must remain at least half full after any operation:
- Internal nodes:
num_keys ≥ ORDER_INTERNAL / 2 = 111 - Leaf nodes:
num_keys ≥ ORDER_LEAF / 2 = 108
Violations of this invariant during delete trigger rebalancing (redistribution from a sibling if possible, merge otherwise).
rotate_right key-shift invariant
When rotate_right borrows the last key of the left sibling and inserts it at
position 0 of the underfull child (internal node case), all existing keys in the
child must be shifted right by one position before inserting the new key.
The shift must cover positions 0..cn → 1..cn+1, implemented with a reverse
loop (same pattern as insert_at). Using slice::rotate_right(1) on [..cn]
is incorrect: it moves key[cn-1] to position 0 where it is immediately
overwritten, leaving position cn with stale data from a previous operation.
The stale byte can exceed MAX_KEY_LEN = 64, causing a bounds panic on the next
traversal of that node.
#![allow(unused)]
fn main() {
// Correct: explicit reverse loop
for i in (0..cn).rev() {
child.key_lens[i + 1] = child.key_lens[i];
child.keys[i + 1] = child.keys[i];
}
child.key_lens[0] = sep_len;
child.keys[0] = sep_key;
}
Prefix Compression — In-Memory Only
Internal node keys are often highly redundant. For a tree indexing sequential IDs,
consecutive separator keys share long common prefixes. AxiomDB implements
CompressedNode as an in-memory representation:
#![allow(unused)]
fn main() {
struct CompressedNode {
prefix: Box<[u8]>, // longest common prefix of all keys in this node
suffixes: Vec<Box<[u8]>>, // remainder of each key after stripping the prefix
}
}
When an internal node page is read from disk, it is optionally decompressed into a
CompressedNode for faster binary search (searching on suffix bytes only). When the
node is written back, the full keys are reconstructed. This is a read optimization
only — the on-disk format always stores full keys.
The compression ratio depends on key structure. For an 8-byte integer key, there is no
prefix to compress. For a 64-byte composite key (category_id || product_name), the
category_id prefix is shared across many consecutive keys and is compressed away.
Tree Height and Fan-Out
| Rows | Tree depth | Notes |
|---|---|---|
| 1–217 | 1 (root = leaf) | Entire tree is one leaf page |
| 218–47,089 | 2 | One root internal + up to 218 leaves |
| 47K–10.2M | 3 | Two levels of internals |
| 10.2M–2.22B | 4 | Covers billion-row tables comfortably |
| >2.22B | 5 | Rare; still fast at O(log n) traversal |
A tree of 1 billion rows has depth 4 — a point lookup requires reading 4 pages (1 per level). At 16 KB pages, a warm cache point lookup is ~4 memory accesses with no disk I/O.
Static API — Shared-Storage Operations (Phase 6.2)
BTree normally owns its Box<dyn StorageEngine>. This is convenient for tests but
prevents sharing one MmapStorage between the table heap and multiple indexes. Phase
6.2 adds static functions that accept an external &mut dyn StorageEngine:
#![allow(unused)]
fn main() {
// Point lookup — read-only, no ownership needed
BTree::lookup_in(storage: &dyn StorageEngine, root_pid: u64, key: &[u8])
-> Result<Option<RecordId>, DbError>
// Insert — mutates storage, updates root_pid atomically on root split
BTree::insert_in(storage: &mut dyn StorageEngine, root_pid: &AtomicU64, key: &[u8], rid: RecordId)
-> Result<(), DbError>
// Delete — mutates storage, updates root_pid atomically on root collapse
BTree::delete_in(storage: &mut dyn StorageEngine, root_pid: &AtomicU64, key: &[u8])
-> Result<bool, DbError>
// Batch delete — removes many pre-sorted keys in one left-to-right pass (5.19)
BTree::delete_many_in(storage: &mut dyn StorageEngine, root_pid: &AtomicU64, keys: &[Vec<u8>])
-> Result<(), DbError>
// Range scan — collects all (RecordId, key_bytes) in [lo, hi] into a Vec
BTree::range_in(storage: &dyn StorageEngine, root_pid: u64, lo: Option<&[u8]>, hi: Option<&[u8]>)
-> Result<Vec<(RecordId, Vec<u8>)>, DbError>
}
These delegate to the same private helpers as the owned API. The insert_in and
delete_in variants use AtomicU64::store(Release) instead of compare_exchange
(safe in Phase 6 — single writer).
Batch delete primitive (delete_many_in) — subphase 5.19
delete_many_in accepts a slice of pre-sorted encoded keys and removes all of them
from one index in a single left-to-right tree traversal. The caller is responsible
for sorting keys ascending before the call; the primitive enforces this as a
precondition.
Algorithm:
batch_delete_subtree(root)— dispatches on node type.- Leaf node: binary-search the sorted keys against the leaf’s key array. Remove all matching slots in one pass, compact in-place, write the page once. If the leaf becomes underfull, signal the parent for merge/redistribute.
- Internal node: binary-partition the key slice by separator keys so each child subtree receives only the keys that fall within its range. Recurse into each child that has at least one key to remove. After all children return, rewrite the internal node once if any child pid or separator changed; skip the rewrite otherwise.
- After the recursive pass,
root_pidis updated atomically once viaAtomicU64::store(Release).
Invariants preserved:
- Tree height stays balanced (leaf depth is uniform after the pass).
- In-place fast path from 5.17 is reused: leaf and internal rewrites skip page alloc/free when the node fits in the same page.
- Root is persisted exactly once per
delete_many_incall regardless of how many keys were removed.
range_in returns Vec<(RecordId, Vec<u8>)> rather than an iterator to avoid
lifetime conflicts between the borrow of storage needed to drive the iterator and the
caller's existing `&mut storage` borrow. The heap reads happen after the range scan
completes, which requires full ownership of the results.
Order-Preserving Key Encoding (Phase 6.1b)
Secondary index keys are encoded as byte slices in axiomdb-sql/src/key_encoding.rs
such that encode(a) < encode(b) iff a < b under SQL comparison semantics. Each
Value variant is prefixed with a 1-byte type tag:
| Type | Tag | Payload | Order property |
|---|---|---|---|
NULL | 0x00 | none | Sorts before all non-NULL |
Bool | 0x01 | 1 byte | false < true |
Int(i32) | 0x02 | 8 BE bytes after n ^ i64::MIN | Negative < positive |
BigInt(i64) | 0x03 | 8 BE bytes after n ^ i64::MIN | Negative < positive |
Real(f64) | 0x04 | 8 bytes (NaN=0, pos=MSB set, neg=all flipped) | IEEE order |
Decimal(i128, u8) | 0x05 | 1 (scale) + 16 BE bytes after sign-flip | |
Date(i32) | 0x06 | 8 BE bytes after sign-flip | |
Timestamp(i64) | 0x07 | 8 BE bytes after sign-flip | Older < newer |
Text | 0x08 | NUL-terminated UTF-8, 0x00 escaped as [0xFF, 0x00] | Lexicographic |
Bytes | 0x09 | NUL-terminated, same escape | Lexicographic |
Uuid | 0x0A | 16 raw bytes | Lexicographic |
For composite keys the encodings are concatenated — the first column has the most significant sort influence.
NULL handling: NULL values are not inserted into secondary index B-Trees. This is
consistent with SQL semantics (NULL ≠ NULL) and avoids DuplicateKey errors when
multiple NULLs appear in a UNIQUE index. WHERE col = NULL always falls through to a
full scan.
Maximum key length: 768 bytes. Keys exceeding this return DbError::IndexKeyTooLong
and are silently skipped during CREATE INDEX.
Catalog System
The catalog is AxiomDB’s schema repository. It stores the definition of logical databases, tables, columns, indexes, constraints, foreign keys, and planner statistics, then makes that information available to the SQL analyzer and executor through a consistent, MVCC-aware reader interface.
Design Goals
- Self-describing: The catalog tables are themselves stored as regular heap pages. The engine needs no external schema file.
- Persistent: Catalog data survives crashes. The WAL treats catalog mutations like any other transaction.
- MVCC-visible: A DDL statement that creates a table is visible to subsequent statements in the same transaction but invisible to concurrent transactions until committed.
- Bootstrappable: An empty database file contains no catalog rows. The first
open()runs a special bootstrap path that allocates the catalog roots and inserts the default logical databaseaxiomdb.
System Tables
The catalog consists of eight logical heaps rooted from the meta page. User-facing introspection is documented in Catalog & Schema.
| Table | Meta offset | Contents |
|---|---|---|
axiom_tables | 32 | One row per user-visible table |
axiom_columns | 40 | One row per column, in declaration order |
axiom_indexes | 48 | One row per index (includes partial index predicate since Phase 6.7) |
axiom_constraints | 72 | Named CHECK constraints (Phase 4.22b) |
axiom_foreign_keys | 84 | One row per FK constraint (Phase 6.5) |
axiom_stats | 96 | Per-column NDV and row_count for planner (Phase 6.10) |
axiom_databases | 104 | One row per logical database |
axiom_table_databases | 112 | Optional table ownership binding by database |
Each root page is stored at the corresponding u64 body offset in the meta page
(page 0). Older database files may have 0 in the new database offsets; the
open path upgrades them lazily by allocating the roots and inserting
axiomdb.
schema_name inside
TableDef to fake a database namespace. Keeping database ownership in
axiom_table_databases preserves on-disk compatibility now and leaves
real CREATE SCHEMA room later, unlike a shortcut that would collapse two
separate namespaces into one field.
axiom_databases row format (DatabaseDef)
[name_len: 1 byte u8]
[name: name_len UTF-8 bytes]
Fresh databases always contain:
axiomdb
axiom_table_databases row format (TableDatabaseDef)
[table_id: 4 bytes LE u32]
[name_len: 1 byte u8]
[database_name: name_len UTF-8 bytes]
Missing binding row means: this is a legacy table owned by axiomdb.
axiom_stats row format (StatsDef)
[table_id: 4 bytes LE u32]
[col_idx: 2 bytes LE u16]
[row_count: 8 bytes LE u64] — visible rows at last ANALYZE / CREATE INDEX
[ndv: 8 bytes LE i64] — distinct non-NULL values (PostgreSQL stadistinct encoding)
ndv encoding (same as PostgreSQL stadistinct):
> 0→ absolute count (e.g. 9999 unique emails)= 0→ unknown → planner usesDEFAULT_NUM_DISTINCT = 200
Stats root is lazily initialized at first write (ensure_stats_root). Pre-6.10
databases open without migration: list_stats returns empty vec when root = 0,
causing the planner to use the conservative default (always use index).
Stats are bootstrapped at CREATE INDEX time by reusing the table scan already
performed for B-Tree build — no extra I/O. ANALYZE TABLE refreshes them with
an exact full-table NDV count.
axiom_foreign_keys row format (FkDef)
[fk_id: 4 bytes LE u32]
[child_table_id: 4 bytes LE u32] — table with the FK column
[child_col_idx: 2 bytes LE u16] — FK column index in child table
[parent_table_id:4 bytes LE u32] — referenced (parent) table
[parent_col_idx: 2 bytes LE u16] — referenced column in parent table
[on_delete: 1 byte u8 ] — 0=NoAction, 1=Restrict, 2=Cascade, 3=SetNull
[on_update: 1 byte u8 ] — same encoding
[fk_index_id: 4 bytes LE u32] — 0 = user-provided index (not auto-created)
[name_len: 4 bytes LE u32]
[name: name_len bytes UTF-8]
FkAction encoding: 0 = NoAction, 1 = Restrict, 2 = Cascade,
3 = SetNull, 4 = SetDefault.
fk_index_id = 0 means the FK column already had a user-provided index; the FK
did not auto-create one and will not drop one on DROP CONSTRAINT.
axiom_indexes — predicate extension (Phase 6.7)
The IndexDef binary format was extended in Phase 6.7 with a backward-compatible
predicate section appended after the columns:
[...existing fields...][ncols:1][col_idx:2, order:1]×ncols
[pred_len:2 LE][pred_sql: pred_len UTF-8 bytes] ← absent on pre-6.7 rows
pred_len = 0 (or section absent) → full index. Pre-6.7 databases open without
migration because from_bytes checks bytes.len() > consumed before reading
the predicate section.
CatalogBootstrap
CatalogBootstrap is a one-time procedure that runs when open() encounters an
empty database file (or a file with the meta page uninitialized).
Bootstrap Sequence
1. Allocate page 0 (Meta page).
Write format_version, zero for catalog_root_page, freelist_root_page, etc.
2. Allocate the freelist root page.
Initialize the bitmap (all pages allocated so far are marked used).
Write freelist_root_page into the meta page.
3. Allocate heap roots for catalog tables and aux heaps:
`axiom_tables`, `axiom_columns`, `axiom_indexes`, `axiom_constraints`,
`axiom_foreign_keys`, `axiom_stats`, `axiom_databases`, `axiom_table_databases`.
4. Insert the default database row `axiomdb` into `axiom_databases`.
5. Persist every root page id into the meta page.
6. Flush pages and WAL.
Fresh bootstrap uses txn_id = 0 for the default database row because no user
transaction exists yet. If a pre-22b.3a database is reopened, ensure_database_roots
upgrades it in-place and inserts axiomdb exactly once.
CatalogReader
CatalogReader provides read-only access to the catalog from any component that
needs schema information (primarily the SQL analyzer).
#![allow(unused)]
fn main() {
pub struct CatalogReader<'a> {
storage: &'a dyn StorageEngine,
snapshot: TransactionSnapshot,
}
impl<'a> CatalogReader<'a> {
/// List all user tables visible to this snapshot.
pub fn list_tables(&mut self, schema: &str) -> Result<Vec<TableDef>, DbError>;
/// List all logical databases visible to this snapshot.
pub fn list_databases(&mut self) -> Result<Vec<DatabaseDef>, DbError>;
/// Find a specific table by schema + name.
pub fn get_table(&mut self, schema: &str, name: &str) -> Result<Option<TableDef>, DbError>;
/// Find a specific table by database + schema + name.
pub fn get_table_in_database(
&mut self,
database: &str,
schema: &str,
name: &str,
) -> Result<Option<TableDef>, DbError>;
/// List columns for a table in declaration order.
pub fn list_columns(&mut self, table_id: u64) -> Result<Vec<ColumnDef>, DbError>;
/// List indexes for a table.
pub fn list_indexes(&mut self, table_id: u64) -> Result<Vec<IndexDef>, DbError>;
}
}
The snapshot parameter ensures catalog reads are MVCC-consistent. A DDL statement
that has not yet committed is invisible to other transactions’ CatalogReader.
Effective database resolution
Catalog lookup is now two-dimensional:
(database, schema, table)
The resolver applies one legacy rule:
if no explicit table->database binding exists:
effective database = "axiomdb"
That rule is what lets old databases keep working without rewriting existing
TableDef rows.
Schema Types
#![allow(unused)]
fn main() {
pub struct TableDef {
pub id: u32,
pub root_page_id: u64, // heap root or clustered-tree root
pub storage_layout: TableStorageLayout,
pub schema_name: String,
pub table_name: String,
pub schema_version: u64, // monotonic counter for plan cache invalidation (Phase 40.2)
}
pub enum TableStorageLayout {
Heap = 0,
Clustered = 1,
}
// On-disk format for axiom_tables rows (3 generations, all backward-compatible):
//
// v0 (legacy, no trailing bytes):
// [table_id:4 LE][root_page_id:8 LE][schema_len:1][schema UTF-8][name_len:1][name UTF-8]
// → storage_layout = Heap, schema_version = 1
//
// v1 (1 trailing byte):
// ... [layout:1]
// → layout from byte, schema_version = 1
//
// v2 (9 trailing bytes, current):
// ... [layout:1][schema_version:8 LE]
// → layout and schema_version from bytes
//
// `schema_version` is initialized to 1 at table creation. It is bumped by:
// CREATE INDEX, DROP INDEX, ALTER TABLE (any op), DROP TABLE, TRUNCATE TABLE.
// Plans whose deps include (table_id, old_version) detect staleness on next
// lookup without scanning the entire plan cache (Phase 40.2 OID invalidation).
pub struct ColumnDef {
pub table_id: u64,
pub col_index: usize, // zero-based position within the table
pub col_name: String,
pub data_type: DataType, // from axiomdb-core::types::DataType
pub not_null: bool,
pub default_value: Option<String>, // DEFAULT expression as source text
}
pub struct IndexDef {
pub id: u64,
pub table_id: u64,
pub index_name: String,
pub is_unique: bool,
pub is_primary: bool,
pub columns: Vec<String>, // indexed column names in key order
pub root_page_id: u64, // B+ Tree root, or clustered table root for PRIMARY KEY metadata
}
}
DDL Mutations Through the Catalog
When the executor processes CREATE TABLE, it:
- Opens a write transaction (or participates in the current one).
- Allocates a new
TableIdfrom the meta page sequence. - Chooses the table layout:
- no explicit
PRIMARY KEY→Heap - explicit
PRIMARY KEY→Clustered
- no explicit
- Allocates the primary row-store root page:
Heap→PageType::DataClustered→PageType::ClusteredLeaf
- Inserts a row into
axiom_tableswith{id, root_page_id, storage_layout, schema_name, table_name}. - Inserts one row per column into
axiom_columns. - Persists index metadata:
- clustered tables reuse
table.root_page_idfor the logical PRIMARY KEY index row UNIQUEsecondary indexes still allocate ordinaryPageType::Indexroots
- clustered tables reuse
- Appends all these mutations to the WAL.
- Commits (or defers the commit to the surrounding transaction).
The root_page_id stored in axiom_tables is now the single entry point for the
table’s primary row store. Heap DML still uses it as the heap-chain root today;
clustered INSERT / SELECT now use it as the clustered row-store root, while
heap-only executor paths still reject clustered UPDATE / DELETE instead of
touching the wrong page format.
Because the catalog is stored in heap pages and indexed like any other table, all
crash recovery mechanisms apply automatically: WAL replay will reconstruct the catalog
state after a crash in the middle of CREATE TABLE, just as it would reconstruct
any other table mutation.
Catalog Page Organization
Page 0: Meta page (format_version, catalog_root_page, freelist_root_page, ...)
Page 1: FreeList bitmap root
Pages 2–N: B+ Tree pages for axiom_tables
Pages N+1–M: Heap pages for axiom_tables row data
Pages M+1–P: B+ Tree pages for axiom_columns
...
Pages P+1–Q: User table data begins here
The exact page assignments depend on database growth. Page 0 always remains the meta page. All other page assignments are dynamic — the freelist tracks which pages are in use, and the meta page records the root page IDs for each catalog B+ Tree.
Catalog Invariants
The following invariants must hold at all times. The startup verifier in
axiomdb-sql::index_integrity now re-checks the index-related ones after WAL
recovery and before server or embedded mode starts serving traffic:
- Every table listed in
axiom_tableshas at least one row inaxiom_columns. - Every column in
axiom_columnsreferences atable_idthat exists inaxiom_tables. - Every index in
axiom_indexesreferences atable_idthat exists inaxiom_tables. - Every non-clustered
root_page_idinaxiom_indexespoints to a page of typeIndex. - A clustered table’s PRIMARY KEY metadata row in
axiom_indexesreuses the tableroot_page_idand therefore may point toClusteredLeaf/ClusteredInternal. - Every column listed in an index definition exists in the referenced table.
- No two tables in the same schema have the same name.
- No two indexes on the same table have the same name.
Startup index integrity verification
For every catalog-visible heap table:
- enumerate the expected entries from heap-visible rows
- enumerate the actual B+ Tree entries from
root_page_id - compare them exactly
- if the tree is readable but divergent, rebuild a fresh root from heap
- rotate the catalog root in a WAL-protected transaction
- defer free of the old tree pages until commit durability is confirmed
Clustered tables are skipped for now because their logical PRIMARY KEY metadata
no longer points at a classic B+ Tree root. If a heap-side tree cannot be
traversed safely, open fails with IndexIntegrityFailure. The database does
not enter a best-effort serving mode with an untrusted index.
REINDEX, AxiomDB rebuilds a readable divergent index from heap rows
instead of trying to patch arbitrary leaf-level damage in place. This keeps recovery logic small
and makes the catalog root swap the only logical state transition.
Row Codec
The row codec converts between &[Value] (the in-memory representation used by
the executor) and &[u8] (the on-disk binary format stored in heap pages). The codec
is in axiomdb-types::codec.
Binary Format
┌──────────────────────────────────────────────────────────────────┐
│ null_bitmap: ceil(n_cols / 8) bytes │
│ bit i = (bitmap[i/8] >> (i%8)) & 1 == 1 → column i is NULL │
├──────────────────────────────────────────────────────────────────┤
│ For each non-NULL column, in column declaration order: │
│ Bool → 1 byte (0x00 = false, 0x01 = true) │
│ Int, Date → 4 bytes little-endian i32 │
│ BigInt, Real → 8 bytes little-endian i64 / f64 │
│ Timestamp → 8 bytes little-endian i64 (µs UTC) │
│ Decimal → 16 bytes little-endian i128 mantissa │
│ + 1 byte u8 scale │
│ Uuid → 16 bytes as-is (big-endian by convention) │
│ Text, Bytes → 3 bytes u24 LE length prefix │
│ + length bytes raw UTF-8 / raw bytes │
└──────────────────────────────────────────────────────────────────┘
NULL columns are indicated only in the null bitmap. No bytes are written for NULL values in the payload section. This means:
- A row with all columns NULL (and a null bitmap) encodes to
ceil(n_cols/8)bytes. - A row with no NULL columns encodes to
ceil(n_cols/8)bytes (all zero bitmap) plus the sum of each column’s fixed width or variable-length payload.
Column Type Sizes
| Value variant | SQL type | Encoded size |
|---|---|---|
Bool | BOOL, BOOLEAN | 1 byte |
Int | INT, INTEGER | 4 bytes |
BigInt | BIGINT | 8 bytes |
Real | REAL, DOUBLE | 8 bytes (f64, IEEE 754) |
Decimal(m,s) | DECIMAL, NUMERIC | 17 bytes (16 i128 + 1 scale) |
Uuid | UUID | 16 bytes |
Date | DATE | 4 bytes (i32 days) |
Timestamp | TIMESTAMP | 8 bytes (i64 µs UTC) |
Text | TEXT, VARCHAR, CHAR | 3 + len bytes |
Bytes | BYTEA, BLOB | 3 + len bytes |
Null Bitmap
The null bitmap occupies ceil(n_cols / 8) bytes at the start of every encoded row.
The bits are packed little-endian: bit 0 of byte 0 corresponds to column 0, bit 1 of
byte 0 to column 1, …, bit 0 of byte 1 to column 8, and so on.
n_cols = 5 → 1 byte (bits 5–7 are unused and always 0)
n_cols = 8 → 1 byte (all 8 bits used)
n_cols = 9 → 2 bytes (bit 0 of byte 1 = column 8)
n_cols = 64 → 8 bytes
n_cols = 65 → 9 bytes
Reading column i:
#![allow(unused)]
fn main() {
let bit = (bitmap[i / 8] >> (i % 8)) & 1;
let is_null = bit == 1;
}
Setting column i as NULL:
#![allow(unused)]
fn main() {
bitmap[i / 8] |= 1 << (i % 8);
}
This design saves 7 bytes per nullable column compared to wrapping each value in
Option<T> (which adds a full word of overhead in Rust’s memory layout).
Why u24 for Variable-Length Fields
The length prefix for Text and Bytes is 3 bytes (a u24 in little-endian). This
covers strings up to 16,777,215 bytes (~16 MB). The codec enforces this limit with
DbError::ValueTooLarge.
Why not u32 (4 bytes)?
The codec has two independent size limits:
- Codec limit (u24): Text/Bytes may not exceed 16,777,215 bytes per value.
- Storage limit (~16 KB): An encoded row must fit within
MAX_TUPLE_DATA, which is approximatelyPAGE_BODY_SIZE - RowHeader_size - SlotEntry_size.
In practice, a single row almost never approaches 16 MB (the codec limit). If it did, it would far exceed the storage limit and be rejected by the heap layer anyway. Using u24 saves 1 byte per string column — for a table with 10 text columns, every row is 10 bytes smaller. At 100 million rows, that is 1 GB of disk savings.
The u24 also signals that future TOAST (out-of-line storage for large values) will take over before values approach 16 MB — TOAST is planned for Phase 6.
Why i128 for DECIMAL
DECIMAL values are represented as (mantissa: i128, scale: u8). The actual value is
mantissa × 10^(-scale).
Decimal(123456789, 2) → 1,234,567.89
Decimal(-199, 2) → -1.99
Decimal(0, 0) → 0
i128 provides 38 significant decimal digits, which matches DECIMAL(38, s) — the
maximum precision supported by most SQL databases including PostgreSQL and SQL Server.
The alternative, rust_decimal::Decimal, packs the same i128 internally but adds
struct overhead and a dependency. The AxiomDB codec stores the i128 mantissa and
scale byte directly, with no intermediary struct.
encoded_len — O(n) Without Allocation
encoded_len(values, types) computes the exact byte count that encode_row would
produce, without allocating a buffer.
#![allow(unused)]
fn main() {
pub fn encoded_len(values: &[Value], types: &[DataType]) -> usize {
let bitmap_bytes = values.len().div_ceil(8);
let payload: usize = values.iter().zip(types.iter())
.filter(|(v, _)| !v.is_null())
.map(|(v, dt)| fixed_size(dt) + variable_overhead(v))
.sum();
bitmap_bytes + payload
}
}
This is used by the heap insertion path to check whether the encoded row fits in the remaining free space on the target page — without actually encoding it first.
encode_row — Single Pass, No Intermediate Buffer
#![allow(unused)]
fn main() {
pub fn encode_row(values: &[Value], types: &[DataType]) -> Result<Vec<u8>, DbError>;
}
The encoder makes one pass over the columns:
- Writes the null bitmap (all zero initially).
- For each column, if the value is
Value::Null, sets the corresponding bitmap bit. Otherwise, type-checks the value against the declared type and appends the encoded bytes. - Returns the complete
Vec<u8>.
The type check step catches programmer errors early (e.g., passing Value::Text for a
column declared DataType::Int). It returns DbError::TypeMismatch rather than
writing corrupted bytes.
decode_row — Position-Tracking Cursor
#![allow(unused)]
fn main() {
pub fn decode_row(bytes: &[u8], types: &[DataType]) -> Result<Vec<Value>, DbError>;
}
The decoder walks bytes with a position cursor:
- Reads the null bitmap from the first
ceil(n_cols/8)bytes. - For each column in order:
- If the corresponding bitmap bit is 1 → push
Value::Null. - Otherwise, read the fixed or variable-length bytes for the declared type,
construct the
Value, advance the cursor.
- If the corresponding bitmap bit is 1 → push
- Returns
Err(DbError::ParseError)if the buffer is shorter than expected (truncated row — indicates storage corruption).
Example — Encoding a Users Row
Schema: users(id BIGINT, name TEXT, age INT, email TEXT, active BOOL)
Values: [BigInt(42), Text("Alice"), Int(30), Null, Bool(true)]
Step 1: null_bitmap = ceil(5/8) = 1 byte
col 3 (email) is NULL → bit 3 of byte 0 → bitmap = 0b00001000 = 0x08
Step 2: encode non-NULL values:
col 0 (BigInt(42)) → 8 bytes: 2A 00 00 00 00 00 00 00
col 1 (Text("Alice")) → 3 bytes length: 05 00 00
+ 5 bytes payload: 41 6C 69 63 65
col 2 (Int(30)) → 4 bytes: 1E 00 00 00
col 3 (NULL) → 0 bytes (indicated by bitmap)
col 4 (Bool(true)) → 1 byte: 01
Final encoding (19 bytes total):
[08] [2A 00 00 00 00 00 00 00] [05 00 00] [41 6C 69 63 65] [1E 00 00 00] [01]
^ bigint 42 ^len=5 "Alice" int 30 true
bitmap: col 3 is NULL
encoded_len for this row would return 19 without allocating any buffer.
NaN Constraint
Value::Real(f64::NAN) is a valid Rust value but is forbidden by the codec.
encode_row returns DbError::InvalidValue when it encounters NaN.
This is enforced because:
- SQL semantics require
NaN <> NaNto be UNKNOWN, not FALSE. - Storing NaN in the database would make equality comparisons unpredictable.
- IEEE 754 defines NaN as not-a-number — it is a sentinel, not a data value.
Code that constructs Value::Real must ensure the f64 is not NaN before passing
it to the codec. The executor’s arithmetic operations must propagate NaN as NULL.
Type Coercion (axiomdb-types::coerce)
The axiomdb-types::coerce module implements implicit type conversion. It is
separate from the codec: the codec only serializes well-typed Values; coercion
happens before encoding, at expression evaluation and column assignment time.
Two entry points
coerce(value, target: DataType, mode: CoercionMode) -> Result<Value, DbError>
Used by the executor on INSERT and UPDATE to convert a supplied value to the declared column type. Examples:
coerce(Text("42"), DataType::Int, Strict)→Ok(Int(42))coerce(Int(7), DataType::BigInt, Strict)→Ok(BigInt(7))coerce(Date(1), DataType::Timestamp, Strict)→Ok(Timestamp(86_400_000_000))coerce(Null, DataType::Int, Strict)→Ok(Null)— NULL always passes through
coerce_for_op(l, r) -> Result<(Value, Value), DbError>
Used by the expression evaluator in eval_binary to promote two operands to a
common type before arithmetic or comparison. Does not accept a
CoercionMode — operator widening is always deterministic and does not attempt
Text→numeric parsing.
coerce_for_op(Int(5), Real(1.5))→(Real(5.0), Real(1.5))coerce_for_op(Int(2), Decimal(314, 2))→(Decimal(200, 2), Decimal(314, 2))— Int is scaled by10^scaleso it has the same unit as the Decimal mantissa
CoercionMode
#![allow(unused)]
fn main() {
pub enum CoercionMode {
Strict, // AxiomDB default — '42abc'→INT = error
Permissive, // MySQL compat — '42abc'→INT = 42 (stops at first non-digit)
}
}
Complete conversion matrix
The full set of implicit conversions supported by coerce():
| From | To | Rule |
|---|---|---|
| Any | same type | Identity — returned unchanged |
NULL | any | Returns NULL |
Int(n) | BigInt | BigInt(n as i64) — lossless |
Int(n) | Real | Real(n as f64) — may lose precision for large values |
Int(n) | Decimal | Decimal(n, 0) — lossless |
BigInt(n) | Int | Range check: error if n ∉ [i32::MIN, i32::MAX] |
BigInt(n) | Real | Real(n as f64) |
BigInt(n) | Decimal | Decimal(n, 0) |
Text(s) | Int | Parse full string as integer (strict) or leading digits (permissive) |
Text(s) | BigInt | Same as Int but target is i64 |
Text(s) | Real | Parse as f64; NaN/Inf are always rejected |
Text(s) | Decimal | Parse as [-][int][.][frac]; scale = fraction digit count |
Date(d) | Timestamp | d * 86_400_000_000 µs — midnight UTC |
Bool(b) | Int/BigInt/Real | Permissive mode only: true→1, false→0 |
| everything else | DbError::InvalidCoercion (SQLSTATE 22018) |
Text → integer parsing rules in detail
Strict mode (AxiomDB default):
- Strip leading/trailing ASCII whitespace.
- Parse the entire remaining string as a decimal integer (optional leading
-/+). - Any non-digit character after the optional sign →
InvalidCoercion. - Overflow (value does not fit in target type) →
InvalidCoercion.
Permissive mode (MySQL compat):
- Strip whitespace.
- Read optional sign.
- Consume as many leading ASCII digit characters as possible.
- If zero digits consumed → return
0(e.g.,"abc"→0). - Parse accumulated digits; overflow →
InvalidCoercion(not silently clamped).
Date → Timestamp conversion
Date stores days since 1970-01-01 as i32. Timestamp stores microseconds
since 1970-01-01 UTC as i64.
Timestamp = Date × 86_400_000_000
= days × 86400 seconds/day × 1_000_000 µs/second
Day 0 = 1970-01-01T00:00:00Z = Timestamp 0. Negative days produce negative
Timestamps (dates before the Unix epoch). The multiplication uses checked_mul
— overflow is impossible for any plausible calendar date but is handled
defensively.
Int → Decimal scale adoption in coerce_for_op
When coerce_for_op promotes an Int or BigInt to match a Decimal, it uses
the Decimal operand’s existing scale so that the result is expressed in the
same unit:
coerce_for_op(Int(5), Decimal(314, 2)):
factor = 10^2 = 100
Int(5) → Decimal(5 × 100, 2) = Decimal(500, 2)
→ (Decimal(500, 2), Decimal(314, 2))
eval_arithmetic(Add, Decimal(500, 2), Decimal(314, 2)):
→ Decimal(814, 2) = 8.14 ✓
Without scale adoption, 5 + 3.14 would compute Decimal(5 + 314, 2) = Decimal(319, 2) = 3.19 — wrong.
SQL Parser
The SQL parser lives in axiomdb-sql and is split into three stages:
lexer (string → tokens), parser (tokens → AST), and semantic analyzer
(AST → validated AST with resolved column indices). This page covers the lexer and
parser. The semantic analyzer is documented in Semantic Analyzer.
Why logos, Not nom
AxiomDB uses the logos crate to generate the lexer, rather than nom combinators
or hand-written code.
| Criterion | logos | nom |
|---|---|---|
| Compilation model | Compiles patterns to DFA at build time | Constructs parsers at runtime |
| Token scan cost | O(n), 1–3 instructions/byte | O(n), higher constant factor |
| Heap allocations | Zero (identifiers are &'src str) | Possible in combinators |
| Case-insensitive keys | ignore(ascii_case) attribute | Manual lowercasing pass needed |
| Error messages | Byte offsets built-in | Requires manual tracking |
Benchmark result: AxiomDB’s lexer achieves 9–17× higher throughput than
sqlparser-rs (which uses nom internally) for the same SQL inputs. The advantage
holds across simple SELECT, complex multi-join SELECT, and DDL statements.
sqlparser-rs is the SQL parser used by Apache Arrow DataFusion, Delta Lake, and InfluxDB. The DFA advantage is structural: logos compiles all keyword patterns into a single transition matrix at build time. Processing each character is one table lookup — nom combinators perform dynamic dispatch and build intermediate allocations for each combinator step.
The primary reason is the DFA: logos compiles all keyword patterns into a single Deterministic Finite Automaton at compile time. Processing each character is a table lookup in a pre-computed transition matrix — constant time per character with a very small constant. nom combinators perform dynamic dispatch and allocate intermediate results.
Lexer Design
Zero-Copy Tokens
Identifiers and quoted identifiers are represented as &'src str — slices into the
original SQL string. No heap allocation occurs during lexing for identifiers.
Only StringLit allocates a String, because escape sequence processing (\', \\,
\n) transforms the content in place and cannot be zero-copy.
#![allow(unused)]
fn main() {
pub struct SpannedToken<'src> {
pub token: Token<'src>,
pub span: Span, // byte offsets (start, end) in the original string
}
}
The lifetime 'src ensures that token slices cannot outlive the input string.
Token Enum
The Token<'src> enum has approximately 85 variants:
#![allow(unused)]
fn main() {
pub enum Token<'src> {
// DML keywords (case-insensitive)
Select, From, Where, Insert, Into, Values, Update, Set, Delete,
// DDL keywords
Create, Database, Databases, Table, Index, Drop, Alter, Add, Column, Constraint,
// Transaction keywords
Begin, Commit, Rollback, Savepoint, Release,
// Session / introspection
Use,
// Data types
Bool, Boolean, TinyInt, SmallInt, Int, Integer, BigInt, HugeInt,
Real, Float, Double, Decimal, Numeric, Char, VarChar, Text, Bytea, Blob,
Date, Time, Timestamp, Uuid, Json, Jsonb, Vector,
// Clause keywords
Join, Inner, Left, Right, Cross, On, Using,
Group, By, Having, Order, Asc, Desc, Nulls, First, Last,
Limit, Offset, Distinct, All,
// Constraint keywords
Primary, Key, Unique, Not, Null, Default, References, Check,
Auto, Increment, Serial, Bigserial, Foreign, Cascade, Restrict, NoAction,
// Logical operators
And, Or,
// Functions
Is, In, Between, Like, Ilike, Exists, Case, When, Then, Else, End,
Coalesce, NullIf,
// Identifier variants
Ident(&'src str), // unquoted identifier
QuotedIdent(&'src str), // backtick-quoted `identifier`
DqIdent(&'src str), // double-quote "identifier"
// Literals
IntLit(i64), FloatLit(f64), StringLit(String), HexLit(Vec<u8>),
TrueLit, FalseLit, NullLit,
// Punctuation
LParen, RParen, Comma, Semicolon, Dot, Star, Eq, Ne, Lt, Le, Gt, Ge,
Plus, Minus, Slash, Percent, Bang, BangEq, Arrow, FatArrow,
// Sentinel
Eof,
}
}
Keyword Priority Over Identifiers
logos resolves ambiguities by matching keywords before identifiers. The rule is:
longer matches take priority; if lengths are equal, keywords take priority over
Ident. This is expressed in logos as:
#![allow(unused)]
fn main() {
#[token("SELECT", ignore(ascii_case))]
Select,
#[regex(r"[A-Za-z_][A-Za-z0-9_]*")]
Ident(&'src str),
}
SELECT, select, and Select all produce Token::Select, not Token::Ident.
A hypothetical column named select must be escaped: `select` or "select".
Comment Stripping
All three MySQL-compatible comment styles are skipped automatically:
-- single-line comment (SQL standard)
# single-line comment (MySQL extension)
/* block comment */
fail-fast Limits
tokenize(sql, max_bytes) checks the SQL length before scanning. If sql.len() > max_bytes,
it returns DbError::ParseError immediately without touching the DFA. This protects
against memory exhaustion from maliciously large queries.
Parser Design
The parser is a hand-written recursive descent parser. It does not use any parser combinator library — the grammar is simple enough that combinators would add overhead without benefit.
Parser State
#![allow(unused)]
fn main() {
struct Parser<'src> {
tokens: Vec<SpannedToken<'src>>,
pos: usize,
}
impl<'src> Parser<'src> {
fn peek(&self) -> &Token<'src>; // current token, no advance
fn advance(&mut self) -> &Token<'src>; // consume and return current token
fn expect(&mut self, t: &Token) -> Result<(), DbError>; // consume or error
fn eat(&mut self, t: &Token) -> bool; // consume if matching, else false
}
}
Grammar — LL(1) for DDL, LL(2) for DML
Most DDL productions are LL(1): the first token uniquely determines the production. Some DML productions require one lookahead token:
SELECT * FROM tvsSELECT a, b FROM t— the parser seesSELECTthen peeks at the next token to decide whether to parse*or a projection list.INSERT INTO t VALUES (...)vsINSERT INTO t SELECT ...— after consumingINTO t, peek determines whether to parse a VALUES list or a sub-SELECT.
Expression Precedence
The expression sub-parser implements the standard precedence chain using separate functions for each precedence level. This is equivalent to a Pratt parser without the extra machinery:
parse_expr() (entry point — calls parse_or)
parse_or() OR
parse_and() AND
parse_not() unary NOT
parse_is_null() IS NULL / IS NOT NULL
parse_predicate() =, <>, !=, <, <=, >, >=, BETWEEN, LIKE, IN
parse_addition() + and -
parse_multiplication() *, /, %
parse_unary() unary minus -x
parse_atom() literal, column ref, function call, subexpr
Each level calls the next level to parse its right-hand side, naturally implementing left-to-right associativity and the correct precedence hierarchy.
DDL Grammar Sketch
stmt → select_stmt | insert_stmt | update_stmt | delete_stmt
| create_database_stmt | drop_database_stmt | use_stmt
| create_table_stmt | create_index_stmt
| drop_table_stmt | drop_index_stmt
| alter_table_stmt | truncate_stmt
| show_tables_stmt | show_databases_stmt | show_columns_stmt
| begin_stmt | commit_stmt | rollback_stmt | savepoint_stmt
create_database_stmt →
CREATE DATABASE ident
drop_database_stmt →
DROP DATABASE [IF EXISTS] ident
use_stmt →
USE ident
create_table_stmt →
CREATE TABLE [IF NOT EXISTS] ident
LPAREN column_def_list [COMMA table_constraint_list] RPAREN
column_def →
ident type_name [column_constraint...]
column_constraint →
NOT NULL
| DEFAULT expr
| PRIMARY KEY
| UNIQUE
| AUTO_INCREMENT | SERIAL | BIGSERIAL
| REFERENCES ident LPAREN ident RPAREN [on_action] [on_action]
| CHECK LPAREN expr RPAREN
table_constraint →
PRIMARY KEY LPAREN ident_list RPAREN
| UNIQUE LPAREN ident_list RPAREN
| FOREIGN KEY LPAREN ident_list RPAREN REFERENCES ident LPAREN ident_list RPAREN
| CHECK LPAREN expr RPAREN
| CONSTRAINT ident (primary_key | unique | foreign_key | check)
truncate_stmt →
TRUNCATE TABLE ident
show_tables_stmt →
SHOW TABLES [FROM ident]
show_databases_stmt →
SHOW DATABASES
show_columns_stmt →
SHOW COLUMNS FROM ident
| DESCRIBE ident
| DESC ident
CREATE/DROP DATABASE, USE, and
SHOW DATABASES, but it still rejects database.schema.table.
MySQL allows a database qualifier directly in table references; AxiomDB intentionally
deferred that grammar until the analyzer and executor can honor it end-to-end instead
of shipping a misleading parser-only approximation.
SHOW / DESCRIBE Parsing
SHOW is a dedicated keyword (Token::Show). After consuming it, the parser
peeks at the next token to dispatch:
parse_show():
consume Show
if peek = Databases:
advance
return Stmt::ShowDatabases(ShowDatabasesStmt)
if peek = Ident("TABLES") | Ident("tables"): // COLUMNS is not a reserved keyword
advance
schema = if eat(From): parse_ident() else "public"
return Stmt::ShowTables(ShowTablesStmt { schema })
if peek = Ident("COLUMNS") | Ident("columns"):
advance; expect(From); table = parse_ident()
return Stmt::ShowColumns(ShowColumnsStmt { table_name: table })
else:
return Err(ParseError { "expected TABLES, DATABASES, or COLUMNS after SHOW" })
DESCRIBE and DESC are both tokenized as Token::Describe (the lexer
aliases both spellings to the same token). The parser dispatches them directly
to the ShowColumns AST node:
parse_stmt():
...
Token::Describe => {
advance; table = parse_ident()
return Stmt::ShowColumns(ShowColumnsStmt { table_name: table })
}
...
COLUMNS is not a reserved keyword in AxiomDB — a column or table named
columns does not need quoting. The parser matches it by comparing the
identifier string after lowercasing, not by token variant.
TRUNCATE Parsing
TRUNCATE is tokenized as Token::Truncate. After consuming it, the parser
expects the literal keyword TABLE (also a reserved token) and then the table
name:
parse_truncate():
consume Truncate
expect(Table)
table_name = parse_ident()
return Stmt::Truncate(TruncateTableStmt { table_name })
SELECT Grammar Sketch
select_stmt →
SELECT [DISTINCT] select_list
FROM table_ref [join_clause...]
[WHERE expr]
[GROUP BY expr_list]
[HAVING expr]
[ORDER BY order_item_list]
[LIMIT int_lit [OFFSET int_lit]]
select_list → STAR | select_item (COMMA select_item)*
select_item → expr [AS ident]
table_ref → ident [AS ident]
join_clause →
[INNER | LEFT [OUTER] | RIGHT [OUTER] | CROSS]
JOIN table_ref join_condition
join_condition → ON expr | USING LPAREN ident_list RPAREN
order_item → expr [ASC | DESC] [NULLS (FIRST | LAST)]
Subquery Parsing
Subqueries are parsed at three different points in the expression grammar, each corresponding to a different syntactic form.
Scalar Subqueries — parse_atom
parse_atom is the lowest-precedence entry point for all atoms: literals, column
references, function calls, and parenthesised expressions. When parse_atom
encounters an LParen, it peeks at the next token. If it is Select, it parses
a full select_stmt recursively and wraps it in Expr::Subquery(Box<SelectStmt>).
Otherwise, it parses the contents as a grouped expression (expr).
parse_atom():
if peek = LParen:
if peek+1 = Select:
advance; stmt = parse_select_stmt(); expect(RParen)
return Expr::Subquery(stmt)
else:
advance; e = parse_expr(); expect(RParen)
return e
...
This means (SELECT MAX(id) FROM t) is valid anywhere an expression is valid:
SELECT list, WHERE, HAVING, ORDER BY, even nested inside function calls.
IN Subquery — parse_predicate
parse_predicate handles comparison operators and the IN / NOT IN forms.
After detecting the In or Not In tokens, the parser checks whether the next
token is LParen followed by Select. If so, it parses a subquery and produces
Expr::InSubquery { expr, subquery, negated }. If not, it falls through to the
normal IN (val1, val2, ...) list form.
parse_predicate():
lhs = parse_addition()
if peek = Not:
advance; expect(In); negated = true
else if peek = In:
advance; negated = false
else: return lhs // comparison ops handled here too
expect(LParen)
if peek = Select:
stmt = parse_select_stmt(); expect(RParen)
return Expr::InSubquery { expr: lhs, subquery: stmt, negated }
else:
values = parse_expr_list(); expect(RParen)
return Expr::InList { expr: lhs, values, negated }
EXISTS / NOT EXISTS — parse_not
parse_not handles unary NOT. When the parser sees Exists (or Not Exists),
it consumes the token, expects LParen, recursively parses a select_stmt, and
returns Expr::Exists { subquery, negated }. The result is always boolean — the
SELECT list contents are irrelevant at the execution level.
parse_not():
if peek = Not:
advance
if peek = Exists:
advance; expect(LParen); stmt = parse_select_stmt(); expect(RParen)
return Expr::Exists { subquery: stmt, negated: true }
else:
return Expr::Not(parse_is_null())
if peek = Exists:
advance; expect(LParen); stmt = parse_select_stmt(); expect(RParen)
return Expr::Exists { subquery: stmt, negated: false }
return parse_is_null()
Derived Tables — parse_table_ref
parse_table_ref parses the FROM clause. When it encounters LParen (without
a prior identifier), it recursively parses a select_stmt, expects RParen, and
then requires an AS alias clause (the alias is mandatory for derived tables):
parse_table_ref():
if peek = LParen:
advance; stmt = parse_select_stmt(); expect(RParen)
expect(As); alias = parse_ident()
return TableRef::Derived { subquery: stmt, alias }
else:
name = parse_ident(); alias = optional AS ident
return TableRef::Named { name, alias }
AST Nodes for Subqueries
#![allow(unused)]
fn main() {
pub enum Expr {
// A scalar subquery — returns one value (or NULL if no rows)
Subquery(Box<SelectStmt>),
// IN (SELECT ...) or NOT IN (SELECT ...)
InSubquery {
expr: Box<Expr>,
subquery: Box<SelectStmt>,
negated: bool,
},
// EXISTS (SELECT ...) or NOT EXISTS (SELECT ...)
Exists {
subquery: Box<SelectStmt>,
negated: bool,
},
// Outer column reference (used inside correlated subqueries)
OuterColumn {
col_idx: usize,
depth: u32, // 1 = immediate outer query
},
// ... other variants unchanged
}
pub enum TableRef {
Named { name: String, alias: Option<String> },
Derived { subquery: Box<SelectStmt>, alias: String },
}
}
Correlated Column Resolution — Semantic Analyzer
Correlated subqueries introduce Expr::OuterColumn during semantic analysis
(analyze()), not during parsing. The semantic analyzer maintains a stack of
BindContext frames, one per query level. When a column reference inside a
subquery cannot be resolved against the inner context, the analyzer walks up the
stack and resolves it against the outer context, replacing the Expr::Column
with Expr::OuterColumn { col_idx, depth: 1 }.
This means the parser always produces Expr::Column for every column reference;
OuterColumn only appears in the analyzed AST, never in the raw parse output.
Expr::Column for every column reference, regardless of nesting depth. This keeps the parser stateless and context-free. The semantic analyzer's BindContext stack then resolves ambiguity with full schema knowledge. This is the same split used by PostgreSQL's parser/analyzer boundary: the parser builds a syntactic tree; the analyzer attaches semantic meaning (column indices, correlated references, type information).
Output — The AST
The parser returns a Stmt enum. After parsing, all Expr::Column nodes have
col_idx = 0 as a placeholder. The semantic analyzer fills in the correct indices.
#![allow(unused)]
fn main() {
pub enum Stmt {
Select(SelectStmt),
Insert(InsertStmt),
Update(UpdateStmt),
Delete(DeleteStmt),
CreateTable(CreateTableStmt),
CreateIndex(CreateIndexStmt),
DropTable(DropTableStmt),
DropIndex(DropIndexStmt),
AlterTable(AlterTableStmt),
Truncate(TruncateTableStmt),
Begin, Commit, Rollback,
Savepoint(String),
ReleaseSavepoint(String),
RollbackToSavepoint(String),
ShowTables(ShowTablesStmt),
ShowColumns(ShowColumnsStmt),
}
}
Scalar Function Evaluator (eval/)
The expression evaluator now lives under crates/axiomdb-sql/src/eval/, rooted
at eval/mod.rs. The facade keeps the same exported surface (eval,
eval_with, eval_in_session, eval_with_in_session, is_truthy,
like_match, CollationGuard, SubqueryRunner), but the implementation is
split by responsibility:
context.rs— thread-local session collation,CollationGuard, andSubqueryRunnercore.rs— recursiveExprevaluation, CASE dispatch, and subquery-aware pathsops.rs— boolean logic, comparisons,IN,LIKE, and truthiness helpersfunctions/— built-ins grouped by family (system,nulls,numeric,string,datetime,binary,uuid)
Built-in function dispatch still happens by lowercased name inside
functions/mod.rs. The registry remains a single match arm: no hash map and
no dynamic dispatch.
Date / Time Functions (4.19d)
Four internal helpers drive the MySQL-compatible date functions:
#![allow(unused)]
fn main() {
// Converts Value::Timestamp(micros_since_epoch) to NaiveDateTime.
// Uses Euclidean division for correct sub-second handling of pre-epoch timestamps.
fn micros_to_ndt(micros: i64) -> NaiveDateTime
// Converts Value::Date(days_since_epoch) to NaiveDate.
fn days_to_ndate(days: i32) -> NaiveDate
// Formats NaiveDateTime using MySQL-style format specifiers.
// Maps specifiers manually — NOT via chrono's format strings — to guarantee
// exact MySQL semantics (e.g. chrono's %m has different behavior).
fn date_format_str(ndt: NaiveDateTime, fmt: &str) -> String
// Parses a string into NaiveDateTime + a has_time flag.
// Returns None on any failure (caller maps to Value::Null).
fn str_to_date_inner(s: &str, fmt: &str) -> Option<(NaiveDateTime, bool)>
}
DATE_FORMAT arm — evaluates both args, dispatches ts on type:
ts: Timestamp(micros) → micros_to_ndt → NaiveDateTime
ts: Date(days) → days_to_ndate → NaiveDate.and_time(MIN) → NaiveDateTime
ts: Text(s) → try "%Y-%m-%d %H:%i:%s" then "%Y-%m-%d" via str_to_date_inner
ts: NULL → return NULL immediately
STR_TO_DATE arm — calls str_to_date_inner and converts back to a Value:
has_time = true → Value::Timestamp((ndt - epoch).num_microseconds())
has_time = false → Value::Date((ndt.date() - epoch).num_days() as i32)
failure → Value::Null
The epoch used for both conversions is always NaiveDate(1970-01-01) 00:00:00
constructed with from_ymd_opt(1970,1,1).unwrap().and_hms_opt(0,0,0).unwrap().
This avoids any DateTime<Utc> and is stable across all chrono 0.4.x versions.
str_to_date_inner processes the format string character by character:
- Literal characters: must match verbatim in the input (returns
Noneon mismatch). %Y: consume exactly 4 digits.%y: consume 1–2 digits; apply MySQL 2-digit rule (<70 → +2000, else+1900).%m,%c,%d,%e,%H,%h,%i,%s/%S: consume 1–2 digits.- Unknown specifier: skip one character in the input string.
- After parsing: validate with
NaiveDate::from_ymd_opt+NaiveTime::from_hms_opt(catches invalid dates such as Feb 30).
take_digits(s, max) — helper used by the parser:
#![allow(unused)]
fn main() {
fn take_digits(s: &str, max: usize) -> Option<(u32, &str)> {
let n = s.bytes().take(max).take_while(|b| b.is_ascii_digit()).count();
if n == 0 { return None; }
let val: u32 = s[..n].parse().ok()?;
Some((val, &s[n..]))
}
}
Uses byte positions (safe for all ASCII date strings) and avoids allocations.
GROUP_CONCAT Parsing
GROUP_CONCAT cannot be represented as a plain Expr::Function { args: Vec<Expr> } because
its interior grammar — [DISTINCT] expr [ORDER BY ...] [SEPARATOR 'str'] — is not a
standard argument list. It gets its own AST variant and a dedicated parser branch.
The Expr::GroupConcat Variant
#![allow(unused)]
fn main() {
pub enum Expr {
// ...
GroupConcat {
expr: Box<Expr>,
distinct: bool,
order_by: Vec<(Expr, SortOrder)>,
separator: String, // defaults to ","
},
}
}
The variant stores the sub-expression to concatenate, the deduplication flag, an ordered
list of (sort_key_expr, direction) pairs, and the separator string.
Token::Separator — Disambiguating the Keyword
SEPARATOR is not a reserved word in standard SQL, so the lexer could produce either
Token::Ident("SEPARATOR") or a dedicated Token::Separator. AxiomDB uses the
dedicated token so that the ORDER BY loop inside parse_group_concat can stop cleanly:
#![allow(unused)]
fn main() {
// In the ORDER BY loop — stop if we see SEPARATOR or closing paren
if matches!(p.peek(), Token::Separator | Token::RParen) {
break;
}
}
Without the dedicated token, the parser would need to look ahead through a comma and an identifier to decide whether the comma ends the ORDER BY clause or separates two sort keys.
parse_group_concat — The Parser Branch
Invoked when parse_ident_or_call encounters group_concat (case-insensitive):
parse_group_concat:
consume '('
if DISTINCT: set distinct=true, advance
parse_expr() → sub-expression
if ORDER BY:
loop:
parse_expr() → sort key
optional ASC|DESC → direction
if peek == SEPARATOR or RParen: break
else: consume ','
if SEPARATOR:
consume SEPARATOR
consume StringLit(s) → separator string
consume ')'
return Expr::GroupConcat { expr, distinct, order_by, separator }
string_agg — PostgreSQL Alias
string_agg(expr, separator_literal) is parsed in the same branch with simplified
logic: two arguments separated by a comma, the second being a string literal that
becomes the separator field. distinct is false and order_by is empty.
-- These are equivalent:
SELECT GROUP_CONCAT(name SEPARATOR ', ') FROM t;
SELECT string_agg(name, ', ') FROM t;
Aggregate Execution in the Executor
At execution time, Expr::GroupConcat is handled by an AggAccumulator::GroupConcat
variant. Each row accumulates (value_string, sort_key_values). At finalize:
- Sort by the
order_bykey vector usingcompare_values_null_last— a type-aware comparator that sorts integers numerically and text lexicographically. - If
DISTINCT: deduplicate by value string. - Join with separator, truncate at 1 MB.
- Return
Value::Nullif no non-NULL values were accumulated.
GROUP_CONCAT syntax is structurally different from a regular function call:
it embeds its own ORDER BY and uses a keyword (SEPARATOR) as a
positional argument delimiter. Forcing it into Expr::Function { args } would
require post-parse AST surgery to extract the separator and ORDER BY. A dedicated variant
keeps parsing and execution logic clean and makes semantic analysis and partial-index rejection straightforward.
Error Reporting
ParseError — structured position field
Parse errors carry a dedicated position field (0-based byte offset of the unexpected token):
#![allow(unused)]
fn main() {
DbError::ParseError {
message: "SQL syntax error: unexpected token 'FORM'".to_string(),
position: Some(9), // byte 9 in "SELECT * FORM t"
}
}
The position field is populated from SpannedToken::span.start at every error site in the parser.
Non-parser code that constructs ParseError (e.g. codec validation, catalog checks) sets position: None.
Visual snippet in MySQL ERR packets
When the MySQL handler sends an ERR packet for a parse error, it builds a 2-line visual snippet:
You have an error in your SQL syntax: unexpected token 'FORM'
SELECT * FORM t
^
The snippet is generated by build_error_snippet(sql, pos) in mysql/error.rs:
- Find the line containing
pos(line_start= last\nbeforepos,line_end= next\n). - Clamp the line to 120 characters to avoid overwhelming terminal output.
- Compute
col = pos - line_startand emit" ".repeat(col) + "^"on the second line.
The snippet is appended only when sql is available (COM_QUERY path). Prepared statement
execution errors (COM_STMT_EXECUTE) receive only the plain message.
JSON error format
When error_format = 'json' is active on the connection, the MySQL ERR packet message is
replaced with a JSON string carrying the full ErrorResponse:
{"code":1064,"sqlstate":"42601","severity":"ERROR","message":"SQL syntax error: unexpected token 'FORM'","position":9}
The JSON is built by build_json_error(e, sql) in mysql/json_error.rs. It uses the
ErrorResponse::from_error(e) struct for clean, snippet-free fields (the visual snippet is
text-protocol-only). The JsonErrorPayload struct lives in axiomdb-network to avoid
adding serde as a dependency to axiomdb-core.
axiomdb-core defines DbError and ErrorResponse with no
serde dependency. The JSON payload is assembled in axiomdb-network using
a private #[derive(Serialize)] JsonErrorPayload struct. This keeps the core crate
free of serialization complexity and means error types never accidentally get serialized
somewhere they shouldn't.
Lexer errors (invalid characters, unterminated string literals) include the byte span
of the problematic token via the same position field.
Performance Numbers
Measured on Apple M2 Pro, single-threaded, 1 million iterations each:
| Query | Throughput (logos lexer + parser) |
|---|---|
SELECT * FROM t | 492 ns / query → 2.0M queries/s |
SELECT a, b, c FROM t WHERE id = 1 | 890 ns / query → 1.1M queries/s |
| Complex SELECT (3 JOINs, subquery) | 2.7 µs / query → 370K queries/s |
CREATE TABLE (10 columns) | 1.1 µs / query → 910K queries/s |
INSERT ... VALUES (...) (5 values) | 680 ns / query → 1.5M queries/s |
These numbers represent parse throughput only — before semantic analysis or execution. At 2 million simple queries per second, parsing is never the bottleneck for OLTP workloads at realistic connection concurrency.
&'src str slices into the original SQL string — no heap allocation during lexing. The Rust lifetime 'src enforces at compile time that tokens cannot outlive the input. Only StringLit allocates, because escape processing (\', \\, \n) must transform the content in place.
Semantic Analyzer
The semantic analyzer is the stage between parsing and execution. The parser produces
an AST where every column reference has col_idx = 0 as a placeholder. The analyzer:
- Validates all table and column names against the catalog.
- Resolves each
col_idxto the correct position in the combined row produced by the FROM and JOIN clauses. - Reports structured errors for unknown tables, unknown columns, and ambiguous unqualified column names.
- Applies the current database + schema defaults before unqualified table resolution.
The public compatibility entry point is:
#![allow(unused)]
fn main() {
analyze(stmt, storage, snapshot) -> Result<Stmt, DbError>
}
Internally, the multi-database-aware entry point is:
#![allow(unused)]
fn main() {
analyze_with_defaults(stmt, storage, snapshot, default_database, default_schema)
}
The compatibility wrapper currently uses ("axiomdb", "public").
BindContext — Resolution State
BindContext is built from the FROM and JOIN clauses of a SELECT before any column
reference is resolved.
#![allow(unused)]
fn main() {
struct BindContext {
tables: Vec<BoundTable>,
}
struct BoundTable {
alias: Option<String>, // FROM users AS u → alias = Some("u")
name: String, // real table name in the catalog
columns: Vec<ColumnDef>, // columns in declaration order (from CatalogReader)
col_offset: usize, // start position in the combined row
}
}
Building the BindContext
Each table in the FROM clause is added in left-to-right order. The col_offset
of each table is the sum of the column counts of all tables added before it.
FROM users u JOIN orders o ON u.id = o.user_id
Table 1: users (4 columns: id, name, age, email) → col_offset = 0
Table 2: orders (4 columns: id, user_id, total, status) → col_offset = 4
Combined row layout:
col 0 u.id
col 1 u.name
col 2 u.age
col 3 u.email
col 4 o.id
col 5 o.user_id
col 6 o.total
col 7 o.status
Database-Scoped Resolution
Table lookup is now keyed by:
(database, schema, table)
The analyzer threads default_database into every catalog lookup and recursive
subquery analysis. For session-driven execution, that default comes from
SessionContext::effective_database().
Legacy compatibility rule:
if a table has no explicit database binding:
it belongs to axiomdb
So identical SQL text can resolve differently depending on the selected database:
USE analytics;
SELECT * FROM users;
USE axiomdb;
SELECT * FROM users;
DATABASE() to be NULL before any explicit
selection, but AxiomDB still had to keep legacy unqualified table names working.
The analyzer therefore resolves against an effective database with fallback
axiomdb, while the session separately tracks whether the user explicitly
selected a database.
Column Resolution Algorithm
Given a column reference (qualifier, name) from the AST:
Qualified Reference (u.email)
- Find the
BoundTablewhose alias or name matchesqualifier.- If no table matches:
DbError::TableNotFound { name: qualifier }.
- If no table matches:
- Within that table’s
columns, find the column whose name matchesname.- If not found:
DbError::ColumnNotFound { table: qualifier, column: name }.
- If not found:
- Return
col_offset + column_position_within_table.
u.email → users.col_offset (0) + position of "email" in users (3) = 3
o.total → orders.col_offset (4) + position of "total" in orders (2) = 6
Unqualified Reference (name only)
- Search all tables in
BindContextfor a column namedname. - Collect all matches across all tables.
- If 0 matches:
DbError::ColumnNotFound. - If 1 match: return the resolved
col_idx. - If 2+ matches:
DbError::AmbiguousColumn { column: name, candidates: [...] }.
-- Unambiguous: only users has 'name'
SELECT name FROM users JOIN orders ON ...
-- Ambiguous: both users and orders have 'id'
SELECT id FROM users JOIN orders ON ...
-- ERROR 42702: column reference "id" is ambiguous
-- (appears in: users.id, orders.id)
-- Fix: qualify the reference
SELECT users.id FROM users JOIN orders ON ...
Subqueries in FROM
Subqueries in the FROM clause (derived tables) are analyzed recursively:
SELECT outer.total
FROM (
SELECT user_id, SUM(total) AS total
FROM orders
WHERE status = 'paid'
GROUP BY user_id
) AS outer
WHERE outer.total > 1000
The inner SELECT is analyzed first, producing a virtual BoundTable whose columns
are the output columns of the subquery (user_id, total). The outer BindContext
then treats this virtual table exactly like a real catalog table.
What the Analyzer Validates per Statement Type
SELECT
- FROM clause: every table reference exists in the catalog (or is a valid subquery).
- JOIN conditions: every column in
ON exprresolves correctly against the BindContext. - SELECT list: every column reference resolves; computed expressions type-check.
- WHERE clause: every column reference resolves.
- GROUP BY: every expression resolves.
- HAVING: every column reference resolves (must be either in GROUP BY or aggregate).
- ORDER BY: every expression resolves.
INSERT
- Target table exists in the catalog.
- Each named column in the column list exists in the table.
- If
INSERT ... SELECT, the inner SELECT is analyzed. - Column count in VALUES must match the column list (or all non-DEFAULT columns if no column list is given).
UPDATE
- Target table exists in the catalog.
- Every column in SET assignments exists in the table.
- WHERE clause column references resolve against the target table.
DELETE
- Target table exists in the catalog.
- WHERE clause column references resolve against the target table.
CREATE TABLE
- No table with the same name exists (unless
IF NOT EXISTS). - Each
REFERENCES table(col)in a foreign key references a table that exists and a column that exists in that table and is a primary key or unique column. - CHECK expressions are parsed and type-checked (must evaluate to boolean).
DROP TABLE
- Target table exists (unless
IF EXISTS). - No other table has a foreign key pointing to the target (unless
CASCADE).
CREATE INDEX
- Target table exists in the catalog.
- Every indexed column exists in the table.
- No index with the same name already exists (unless
IF NOT EXISTS).
CREATE DATABASE / DROP DATABASE / USE / SHOW DATABASES
These statements are mostly pass-through at the analyzer layer:
CREATE DATABASEandDROP DATABASEcarry names but no column bindingsUSEis validated against the database catalog at execution/wire timeSHOW DATABASESproduces a computed rowset and needs no name resolution
Error Types
| Error | SQLSTATE | When it occurs |
|---|---|---|
TableNotFound | 42P01 | FROM, JOIN, or REFERENCES points to unknown table |
ColumnNotFound | 42703 | Column name not in any in-scope table |
AmbiguousColumn | 42702 | Unqualified column matches in multiple tables |
DuplicateTable | 42P07 | CREATE TABLE for an existing table |
TypeMismatch | 42804 | Expression type incompatible with column type |
Snapshot Isolation in the Analyzer
The analyzer calls CatalogReader::list_tables and CatalogReader::list_columns
with the caller’s TransactionSnapshot. This means the analyzer sees the schema as
it appeared at the start of the current transaction, not the latest committed schema.
This ensures that:
- A concurrent DDL (
CREATE TABLE) that commits after the current transaction began is invisible to the current transaction’s analyzer. - Schema changes within the same transaction are visible to subsequent statements in that same transaction.
Post-Analysis AST
After analysis, every Expr::Column in the AST has its col_idx set to the correct
position in the combined row. The executor uses col_idx to index directly into the
row array — no name lookup occurs at execution time.
#![allow(unused)]
fn main() {
// Before analysis (from parser):
Expr::Column { name: "total".to_string(), table: Some("o".to_string()), col_idx: 0 }
// After analysis (from analyzer):
Expr::Column { name: "total".to_string(), table: Some("o".to_string()), col_idx: 6 }
// col_idx = orders.col_offset (4) + position of "total" in orders (2)
}
This separation of concerns means the executor is a pure interpreter over the analyzed AST — it never touches the catalog and never performs name resolution. All validation errors are caught before any I/O begins.
SQL Executor
The executor is the component that interprets an analyzed Stmt (all column
references resolved to col_idx by the semantic analyzer) and drives it to
completion, returning a QueryResult. It is the highest-level component in the
query pipeline.
Since subphase 5.19a, the executor no longer lives in a single source file.
It is organized under crates/axiomdb-sql/src/executor/ with mod.rs as the
stable facade and responsibility-based source files behind it.
Source Layout
| File | Responsibility |
|---|---|
executor/mod.rs | public facade, statement dispatch, thread-local last-insert-id |
executor/shared.rs | helpers shared across multiple statement families |
executor/select.rs | SELECT entrypoints, projection, ORDER BY/LIMIT wiring |
executor/joins.rs | nested-loop join execution and join-specific metadata |
executor/aggregate.rs | GROUP BY, aggregates, DISTINCT/group-key helpers |
executor/insert.rs | INSERT and INSERT … SELECT paths |
executor/update.rs | UPDATE execution |
executor/delete.rs | DELETE execution and candidate collection |
executor/bulk_empty.rs | shared bulk-empty helpers for DELETE/TRUNCATE |
executor/ddl.rs | DDL, SHOW, ANALYZE, TRUNCATE |
executor/staging.rs | transactional INSERT staging flushes and barrier handling |
Integration Test Layout
The executor integration coverage no longer sits in one giant test binary. The
current axiomdb-sql/tests/ layout is responsibility-based, mirroring the
module split in src/executor/.
| Binary | Main responsibility |
|---|---|
integration_executor | CRUD base and simple transaction behavior |
integration_executor_joins | JOINs and aggregate execution |
integration_executor_query | ORDER BY, LIMIT, DISTINCT, CASE, INSERT ... SELECT, AUTO_INCREMENT |
integration_executor_ddl | SHOW, DESCRIBE, TRUNCATE, ALTER TABLE |
integration_executor_ctx | base SessionContext execution and strict_mode |
integration_executor_ctx_group | ctx-path sorted group-by |
integration_executor_ctx_limit | ctx-path LIMIT / OFFSET coercion |
integration_executor_ctx_on_error | ctx-path on_error behavior |
integration_executor_sql | broader SQL semantics outside the ctx path |
integration_delete_apply | bulk and indexed DELETE apply paths |
integration_insert_staging | transactional INSERT staging |
integration_namespacing | database catalog behavior: CREATE/DROP DATABASE, USE, SHOW DATABASES |
integration_namespacing_cross_db | explicit database.schema.table resolution and cross-db DML/DDL |
integration_namespacing_schema | schema namespacing, search_path, and schema-aware SHOW TABLES |
Shared helpers live in crates/axiomdb-sql/tests/common/mod.rs.
The day-to-day workflow is intentionally narrow:
- start with the smallest binary that matches the code path you changed
- add directly related binaries only when the change touches shared helpers or a nearby execution path
- use
cargo test -p axiomdb-sql --testsas the crate-level confidence gate, not as the default inner-loop command - if a new behavior belongs to an existing themed binary, add the test there instead of creating a new binary immediately
cargo test -p axiomdb-sql --test integration_executor_query
cargo test -p axiomdb-sql --test integration_executor_query test_insert_select_aggregation -- --exact
UPDATE Apply Fast Path (6.20)
6.17 fixed indexed UPDATE discovery, but the default update_range benchmark
was still paying most of its cost after rows had already been found. 6.20
removes that apply-side overhead in four steps:
IndexLookup/IndexRangecandidates are decoded throughTableEngine::read_rows_batch(...), which groupsRecordIds bypage_idand restores the original RID order after each page is read once.- UPDATE evaluates the new row image before touching the heap and drops rows
whose
new_values == old_values. - Stable-RID rewrites accumulate
(key, old_tuple_image, new_tuple_image, page_id, slot_id)and emit their normalUpdateInPlaceWAL records through onerecord_update_in_place_batch(...)call. - If any index really is affected, UPDATE now does one grouped delete pass, one grouped insert pass, and one final root persistence write per index.
The coarse executor bailout is statement-level:
if all physically changed rows keep the same RID
and no SET column overlaps any index key column
and no SET column overlaps any partial-index predicate dependency:
skip index maintenance for the statement
This is the common PK-only UPDATE score WHERE id BETWEEN ... case in
local_bench.py.
indexColumnIsBeingUpdated(), and MariaDB's clustered-vs-secondary UPDATE split all ask the same question: did any index-relevant attribute actually change? `6.20` adapts that rule directly, without adding HOT chains or change buffering.
Entry Point
#![allow(unused)]
fn main() {
pub fn execute(
stmt: Stmt,
storage: &mut dyn StorageEngine,
txn: &mut TxnManager,
) -> Result<QueryResult, DbError>
}
When no transaction is active, execute wraps the statement in an implicit
BEGIN / COMMIT (autocommit mode). Transaction control statements (BEGIN,
COMMIT, ROLLBACK) bypass autocommit and operate on TxnManager directly.
All reads use txn.active_snapshot()? — a snapshot fixed at BEGIN — so that
writes made earlier in the same transaction are visible (read-your-own-writes).
Transactional INSERT staging (Phase 5.21)
5.21 adds a statement-boundary staging path for consecutive
INSERT ... VALUES statements inside one explicit transaction.
Data structure
SessionContext now owns:
#![allow(unused)]
fn main() {
PendingInsertBatch {
table_id: u32,
table_def: TableDef,
columns: Vec<ColumnDef>,
indexes: Vec<IndexDef>,
compiled_preds: Vec<Option<Expr>>,
rows: Vec<Vec<Value>>,
unique_seen: HashMap<u32, HashSet<Vec<u8>>>,
}
}
The buffer exists only while the connection is inside an explicit transaction. Autocommit-wrapped single statements do not use it.
Enqueue path
For every eligible INSERT row, executor/insert.rs does all logical work up
front:
- evaluate expressions
- expand omitted columns
- assign AUTO_INCREMENT if needed
- run CHECK constraints
- run FK child validation
- reject duplicate UNIQUE / PK keys against:
- committed index state
unique_seeninside the current batch
- append the fully materialized row to
PendingInsertBatch.rows
No heap write or WAL append happens yet.
Flush barriers
The batch is flushed before:
SELECTUPDATEDELETE- DDL
COMMIT- table switch to another
INSERTtarget - any ineligible INSERT shape
ROLLBACK discards the batch without heap or WAL writes.
Savepoint ordering invariant
When a transaction uses statement-level savepoints (rollback_statement,
savepoint, ignore), the executor must flush staged rows before taking
the next statement savepoint if the current statement cannot continue the batch.
Without that ordering, a failing statement after a table switch could roll back rows that logically belonged to earlier successful INSERT statements.
Flush algorithm
executor/staging.rs performs:
TableEngine::insert_rows_batch_with_ctx(...)batch_insert_into_indexes(...)- one
CatalogWriter::update_index_root(...)per changed index - stats update
The current design still inserts index entries row-by-row inside the flush. That cost is explicit and remains the next insert-side optimization candidate if future profiling shows it dominates after staging.
ClusteredInsertBatch (Phase 40.1)
Phase 40.1 extends the PendingInsertBatch pattern to clustered (primary-key
ordered) tables, eliminating the per-row CoW B-tree overhead that made clustered
inserts 2× slower than heap inserts inside explicit transactions.
Root cause of the pre-40.1 gap
Before 40.1, every clustered INSERT inside an explicit transaction called
apply_clustered_insert_rows immediately, which performs:
storage.read_page(root)— 16 KB page readstorage.write_page(new_root, page)— 16 KB CoW page write- WAL append
- Secondary index write
For N = 50 000 rows that is 100 000 storage operations just for the base tree.
Data structures
#![allow(unused)]
fn main() {
// session.rs
pub struct StagedClusteredRow {
pub values: Vec<Value>,
pub encoded_row: Vec<u8>,
pub primary_key_values: Vec<Value>,
pub primary_key_bytes: Vec<u8>,
}
pub struct ClusteredInsertBatch {
pub table_id: u32,
pub table_def: TableDef,
pub primary_idx: IndexDef,
pub secondary_indexes: Vec<IndexDef>,
pub secondary_layouts: Vec<ClusteredSecondaryLayout>,
pub compiled_preds: Vec<Option<Expr>>,
pub rows: Vec<StagedClusteredRow>,
pub staged_pks: HashSet<Vec<u8>>, // O(1) intra-batch PK dedup
}
}
StagedClusteredRow is structurally identical to PreparedClusteredInsertRow
(defined in clustered_table.rs) but lives in session.rs to avoid a circular
dependency: clustered_table.rs imports SessionContext.
Enqueue path (enqueue_clustered_insert_ctx)
For each row in the VALUES list:
- Evaluate expressions, expand columns, assign AUTO_INCREMENT.
- Validate CHECK constraints and FK child references.
- Encode via
prepare_row_with_ctx(coerce + PK extract + row codec). - Check
staged_pks— returnUniqueViolationand discard batch on intra-batch PK duplicate. - Push
StagedClusteredRowand insert PK bytes intostaged_pks.
Committed-data PK duplicates are caught at flush time by lookup_physical inside
apply_clustered_insert_rows (same as the pre-40.1 single-statement path).
Flush path (flush_clustered_insert_batch)
1. Sort staged rows ascending by pk_bytes
→ enables append-biased detection in apply_clustered_insert_rows
2. Convert StagedClusteredRow → PreparedClusteredInsertRow (field move)
3. Call apply_clustered_insert_rows (existing function):
a. detect append-biased pattern (all PKs increasing)
b. loop: try_insert_rightmost_leaf_batch → fast O(leaf_capacity) write
fallback: single-row clustered_tree::insert
c. WAL record_clustered_insert per row
d. maintain_clustered_secondary_inserts per row
e. persist changed roots
4. ctx.stats.on_rows_changed + ctx.invalidate_all
try_insert_rightmost_leaf_batch to fill each leaf page once.
Barrier detection
should_flush_clustered_batch_before_stmt returns false only when the next
statement is a VALUES INSERT into the same clustered table (batch continues).
For all other statements, the batch is flushed before dispatch. This mirrors the
existing should_flush_pending_inserts_before_stmt logic for heap tables.
ROLLBACK discards the batch via discard_clustered_insert_batch() — no storage
writes, no WAL entries, no undo needed.
CREATE INDEX on clustered tables (Phase 40.1b)
execute_create_index (ddl.rs) now handles both heap and clustered tables with a
single function. The dispatch happens after the B-Tree root page is allocated:
if table_def.is_clustered() {
primary_idx ← CatalogReader::list_indexes → find(is_primary)
preview_def ← IndexDef { columns, is_unique, fillfactor, root_page_id, … }
layout ← ClusteredSecondaryLayout::derive(&preview_def, &primary_idx)
rows ← scan_clustered_table(storage, &table_def, &col_defs, snap)
for row in rows:
if partial predicate → skip non-matching rows
entry = layout.entry_from_row(row) → physical_key for bloom
layout.insert_row(storage, &root_pid, row) → uniqueness + B-Tree insert
} else {
rows ← scan_table(…)
for row in rows: encode_index_key + BTree::insert_in
}
// step 8: stats bootstrap uses same `rows` Vec — no extra I/O
The ClusteredSecondaryLayout encodes the physical key as
secondary_cols ++ suffix_primary_cols — exactly the format used by runtime
INSERT/UPDATE/DELETE, so a clustered secondary index built by CREATE INDEX
is byte-for-byte compatible with those written by the DML executors.
entry_from_row is called once per row to collect the physical key for the bloom
filter, and insert_row calls it again internally for the B-Tree write. This is
acceptable overhead during a DDL operation (O(n) with constant factor ≈2).
NULL handling
ClusteredSecondaryLayout::entry_from_row returns None when any secondary column
is NULL. Both the bloom key collection and the B-Tree insert are skipped in that case,
consistent with the runtime INSERT path and SQL standard NULL semantics for indexes.
Uniqueness enforcement
insert_row delegates uniqueness to ensure_unique_logical_key_absent, the same
function used at runtime. If an existing row already carries that logical key, the
build fails with DbError::UniqueViolation before the catalog entry is written.
Query Pipeline
SQL string
→ tokenize() logos DFA, ~85 tokens, zero-copy &str
→ parse() recursive descent, produces Stmt with col_idx = 0
→ analyze() BindContext resolves every col_idx
→ execute() dispatches to per-statement handler
├── scan_table HeapChain::scan_visible + decode_row
├── filter eval(WHERE, &row) + is_truthy
├── join nested-loop, apply_join
├── aggregate hash-based GroupState
├── sort apply_order_by, compare_sort_values
├── deduplicate apply_distinct, value_to_key_bytes
├── project project_row / project_grouped_row
└── paginate apply_limit_offset
→ QueryResult::Rows / Affected / Empty
JOIN — Nested Loop
Phase 4 implements nested-loop joins. All tables are pre-scanned once before any loop begins — scanning inside the inner loop would re-read the same data O(n) times and could see partially-inserted rows.
Algorithm
scanned[0] = scan(FROM table)
scanned[1] = scan(JOIN[0] table)
...
combined_rows = scanned[0]
for each JoinClause in stmt.joins:
combined_rows = apply_join(combined_rows, scanned[i+1], join_type, ON/USING)
apply_join per type
| Join type | Behavior |
|---|---|
INNER / CROSS | Emit combined row for each pair where ON is truthy |
LEFT | Emit all left rows; unmatched left → right side padded with NULL |
RIGHT | Emit all right rows; unmatched right → left side padded with NULL; uses a matched_right: Vec<bool> bitset |
FULL | NotImplemented — Phase 4.8+ |
USING condition
USING(col_name) is resolved at execution time using left_schema: Vec<(name, col_idx)>,
accumulated across all join stages. The condition combined[left_idx] == combined[right_idx]
uses SQL equality — NULL = NULL returns UNKNOWN (false), so NULLs never match in USING.
GROUP BY — Strategy Selection (Phase 4.9b)
The executor selects between two GROUP BY execution strategies at runtime:
| Strategy | When selected | Behavior |
|---|---|---|
Hash | Default; JOINs; derived tables; plain scans | HashMap per group key; O(k) memory |
Sorted { presorted: true } | Single-table ctx path + compatible B-Tree index | Stream adjacent equal groups; O(1) memory |
#![allow(unused)]
fn main() {
enum GroupByStrategy {
Hash,
Sorted { presorted: bool },
}
}
Strategy selection (choose_group_by_strategy_ctx) is only active on the
single-table ctx path (execute_with_ctx). All JOIN, derived-table, and
non-ctx paths use Hash.
Prefix Match Rule
The sorted strategy is selected when all four conditions hold:
- Access method is
IndexLookup,IndexRange, orIndexOnlyScan. - Every
GROUP BYexpression is a plainExpr::Column(no function calls, no aliases). - The column references match the leading key prefix of the chosen index in the same order.
- The prefix length ≤ number of index columns.
Examples (index (region, dept)):
| GROUP BY | Result |
|---|---|
region, dept | ✅ Sorted |
region | ✅ Sorted (prefix) |
dept, region | ❌ Hash (wrong order) |
LOWER(region) | ❌ Hash (computed expression) |
This is correct because BTree::range_in guarantees rows arrive in key order,
and equal leading prefixes are contiguous even with extra suffix columns or RID
suffixes on non-unique indexes.
GroupAggregate (sorted) and HashAggregate as separate strategies selected at planning time (pathnodes.h). DuckDB selects the aggregation strategy at physical plan time based on input guarantees. AxiomDB borrows the two-strategy concept but selects at execution time using the already-chosen access method — no separate planner pass needed.
GROUP BY — Hash Aggregation
GROUP BY uses a single-pass hash aggregation strategy: one scan through the filtered rows, accumulating aggregate state per group key.
Specialized Hash Tables (subfase 39.21)
Two hash table types avoid generic dispatch overhead:
-
GroupTablePrimitive— single-column GROUP BY on integer-like values (INT,BIGINT,DOUBLE,Bool). Mapsi64→GroupEntryviahashbrown::HashMap<i64, usize>. No key serialization needed; comparison is a single integer equality check. -
GroupTableGeneric— multi-column GROUP BY, TEXT columns, mixed types, and the global no-GROUP-BY case. Serializes group keys into aVec<u8>reused across rows (zero allocation when capacity fits), maps&[u8]→GroupEntryviahashbrown::HashMap<Box<[u8]>, usize>.
Both tables store entries in a Vec<GroupEntry> and use the hash maps as index
structures. This keeps entries contiguous in memory and avoids pointer chasing
during the accumulation loop.
hashbrown (the same table backing Rust's std::HashMap) uses SIMD-accelerated quadratic probing (SSE2/NEON). For a 62-group workload over 50K rows, this cuts probe overhead by ~30% vs a naïve open-addressing table. The specialized GroupTablePrimitive path avoids serialization entirely, reducing per-row work to one integer hash + one equality check.
Group Key Serialization
Value contains f64 which does not implement Hash in Rust. AxiomDB uses a
custom self-describing byte serialization instead of the row codec:
value_to_key_bytes(Value::Null) → [0x00]
value_to_key_bytes(Value::Int(n)) → [0x02, n as 4 LE bytes]
value_to_key_bytes(Value::Text(s)) → [0x06, len as 4 LE bytes, UTF-8 bytes]
...
Two NULL values produce identical bytes [0x00] → they form one group.
This matches SQL GROUP BY semantics: NULLs are considered equal for grouping
(unlike NULL = NULL in comparisons, which is UNKNOWN).
The group key for a multi-column GROUP BY is the concatenation of all column
serializations. The key_buf: Vec<u8> is allocated once before the scan loop
and reused (with clear() + extend_from_slice) for every row, so multi-column
GROUP BY does not allocate per row for the probe step.
GroupEntry
Each unique group key maps to a GroupEntry:
#![allow(unused)]
fn main() {
struct GroupEntry {
key_values: Vec<Value>, // GROUP BY expression results (for output)
non_agg_col_values: Vec<Value>, // non-aggregate SELECT cols (for HAVING/output)
accumulators: Vec<AggAccumulator>,
}
}
non_agg_col_values is a sparse slice: only columns referenced by non-aggregate
SELECT items or HAVING expressions are stored. Their indices are pre-computed once
(compute_non_agg_col_indices) before the scan loop and reused for every group.
representative_row: Row — the full first source row per group — to resolve HAVING column references. This costs one full Vec<Value> clone per group, regardless of how many columns HAVING actually needs. non_agg_col_values stores only the columns referenced by non-aggregate SELECT items and HAVING, computed once before the scan loop. For a 6-column table where HAVING references 1 column, this reduces per-group memory by ~83%.
Aggregate Accumulators
| Aggregate | Accumulator | NULL behavior |
|---|---|---|
COUNT(*) | u64 counter | Increments for every row |
COUNT(col) | u64 counter | Skips rows where col is NULL |
SUM(col) | Option<Value> | Skips NULL; None if all rows are NULL |
MIN(col) | Option<Value> | Skips NULL; tracks running minimum |
MAX(col) | Option<Value> | Skips NULL; tracks running maximum |
AVG(col) | (sum: Value, count: u64) | Skips NULL; final = sum / count as Real |
AVG always returns Real (SQL standard), even for integer columns. This
avoids integer truncation (MySQL-style AVG(INT) returns DECIMAL but truncates
in many contexts). AVG of all-NULL rows returns NULL.
Fast-path arithmetic (value_agg_add): For SUM, MIN, MAX, and COUNT,
the accumulator is updated via direct arithmetic on Value variants, bypassing
eval(). This eliminates the expression evaluator overhead for the innermost
loop of the aggregate scan.
Ungrouped Aggregates
SELECT COUNT(*) FROM t (no GROUP BY) is handled as a single-group query with
an empty key. Even on an empty table, the executor emits exactly one output
row — (0) for COUNT(*), NULL for SUM/MIN/MAX/AVG. This matches the
SQL standard and every major database.
Column Decode Mask
Before scanning, collect_expr_columns walks all expressions in SELECT items,
WHERE, GROUP BY, HAVING, and ORDER BY to build a Vec<bool> mask indexed by
column position. Only columns with mask[i] == true are decoded from the row
bytes. For a SELECT age, AVG(score) FROM users GROUP BY age query on a
6-column table, this skips decoding name and email (TEXT fields) entirely.
The mask is forwarded to scan_clustered_table_masked as Option<&[bool]> and
passed into decode_row_masked at the codec level, which skips variable-length
fields that are not needed.
GROUP BY — Sorted Streaming Executor (Phase 4.9b)
The sorted executor replaces the hash table with a single linear pass over pre-ordered rows, accumulating state for the current group and emitting it when the key changes.
Algorithm
rows_with_keys = [(row, eval(group_by exprs, row)) for row in combined_rows]
if !presorted:
stable_sort rows_with_keys by compare_group_key_lists
current_key = rows_with_keys[0].key_values
current_accumulators = AggAccumulator::new() for each aggregate
update accumulators with rows_with_keys[0].row
for next in rows_with_keys[1..]:
if group_keys_equal(current_key, next.key_values):
update accumulators with next.row
else:
finalize → apply HAVING → emit output row
reset: current_key = next.key_values, new accumulators, update
finalize last group
Key Comparison
#![allow(unused)]
fn main() {
fn compare_group_key_lists(a: &[Value], b: &[Value]) -> Ordering
fn group_keys_equal(a: &[Value], b: &[Value]) -> bool
}
Uses compare_values_null_last so NULL == NULL for grouping (consistent with
the hash path’s serialization). Comparison is left-to-right: returns the first
non-Equal ordering.
Shared Aggregate Machinery
Both hash and sorted executors reuse the same:
AggAccumulator(state, update, finalize)eval_with_aggs(HAVING evaluation)project_grouped_row(output projection)build_grouped_column_meta(column metadata)- GROUP_CONCAT handling
- Post-group DISTINCT / ORDER BY / LIMIT
O(1) accumulator memory (one group at a time) instead of O(k) where k = distinct groups. For a high-cardinality column with many distinct values, this eliminates the entire hash table allocation.
ORDER BY — Multi-Column Sort
ORDER BY is applied after scan + filter + aggregation but before projection
for non-GROUP BY queries. For GROUP BY queries, it is applied to the projected
output rows after remap_order_by_for_grouped rewrites column references.
ORDER BY in GROUP BY Context — Expression Remapping
Grouped output rows are indexed by SELECT output position: position 0 = first
SELECT item, position 1 = second, etc. ORDER BY expressions, however, are
analyzed against the source schema where Expr::Column { col_idx } refers to
the original table column.
remap_order_by_for_grouped fixes this mismatch before calling apply_order_by:
remap_order_by_for_grouped(order_by, select_items):
for each ORDER BY item:
rewrite expr via remap_expr_for_grouped(expr, select_items)
remap_expr_for_grouped(expr, select_items):
if expr == select_items[pos].expr (structural PartialEq):
return Column { col_idx: pos } // output position
match expr:
BinaryOp → recurse into left, right
UnaryOp → recurse into operand
IsNull → recurse into inner
Between → recurse into expr, low, high
Function → recurse into args
other → return unchanged
This means ORDER BY dept (where dept is Expr::Column{col_idx:1} in the
source) becomes Expr::Column{col_idx:0} when the SELECT is SELECT dept, COUNT(*),
correctly indexing into the projected output row.
Aggregate expressions like ORDER BY COUNT(*) are matched structurally:
if Expr::Function{name:"count", args:[]} appears in the SELECT at position 1,
it is rewritten to Expr::Column{col_idx:1}.
PartialEq on Expr (which is derived) to identify ORDER BY expressions that match SELECT items. This is simpler than PostgreSQL's SortClause/TargetEntry reference system and correct for the common cases (column references, aggregates, compound expressions).
NULL Ordering Defaults (PostgreSQL-compatible)
| Direction | Default | Override |
|---|---|---|
ASC | NULLs LAST | NULLS FIRST |
DESC | NULLs FIRST | NULLS LAST |
compare_sort_values(a, b, direction, nulls_override):
nulls_first = explicit_nulls_order OR (DESC && no explicit)
if a = NULL and b = NULL → Equal
if a = NULL → Less if nulls_first, else Greater
if b = NULL → Greater if nulls_first, else Less
otherwise → compare a and b, reverse if DESC
Non-NULL comparison delegates to eval(BinaryOp{Lt}, Literal(a), Literal(b))
via the expression evaluator, reusing all type coercion and promotion logic.
Error Propagation from sort_by
Rust’s sort_by closure cannot return Result. AxiomDB uses the sort_err
pattern: errors are captured in Option<DbError> during the sort and returned
after it completes.
#![allow(unused)]
fn main() {
let mut sort_err: Option<DbError> = None;
rows.sort_by(|a, b| {
match compare_rows_for_sort(a, b, order_items) {
Ok(ord) => ord,
Err(e) => { sort_err = Some(e); Equal }
}
});
if let Some(e) = sort_err { return Err(e); }
}
DISTINCT — Deduplication
SELECT DISTINCT is applied after projection and before LIMIT/OFFSET, using
a HashSet<Vec<u8>> keyed by value_to_key_bytes.
fn apply_distinct(rows: Vec<Row>) -> Vec<Row>:
seen = HashSet::new()
for row in rows:
key = concat(value_to_key_bytes(v) for v in row)
if seen.insert(key): // first occurrence
keep row
Two rows are identical if every column value serializes to the same bytes.
Critically, NULL → [0x00] means two NULLs are considered equal for
deduplication — only one row with a NULL in that position is kept. This is the
SQL standard behavior for DISTINCT, and is different from equality comparison
where NULL = NULL returns UNKNOWN.
LIMIT / OFFSET — Row-Count Coercion (Phase 4.10d)
apply_limit_offset runs after ORDER BY and DISTINCT. It calls
eval_row_count_as_usize for each row-count expression.
Row-count coercion contract
| Evaluated value | Result |
|---|---|
Int(n) where n ≥ 0 | n as usize |
BigInt(n) where n ≥ 0 | usize::try_from(n) — errors on overflow |
Text(s) where s.trim() parses as an exact base-10 integer ≥ 0 | parsed value as usize |
negative Int or BigInt | DbError::TypeMismatch |
non-integral Text ("10.1", "1e3", "abc") | DbError::TypeMismatch |
NULL, Bool, Real, Decimal, Date, Timestamp | DbError::TypeMismatch |
Text coercion is intentionally narrow: only exact base-10 integers are accepted. Scientific notation, decimal fractions, and time-like strings are all rejected.
Why Text is accepted
The prepared-statement SQL-string substitution path serializes a Value::Text("2")
parameter as LIMIT '2' in the generated SQL. Without Text coercion, the fallback
path would always fail for string-bound LIMIT parameters — which is the binding
type used by some MariaDB clients. Accepting exact integer Text keeps the
cached-AST prepared path and the SQL-string fallback path on identical semantics.
coerce() function here.
coerce() uses assignment-coercion semantics and would change the
error class to InvalidCoercion, masking the semantic error.
eval_row_count_as_usize implements the narrower 4.10d contract
directly in the executor, keeping the error class and message family consistent
for both prepared paths.
INSERT … SELECT — MVCC Isolation
INSERT INTO target SELECT ... FROM source executes the SELECT phase under
the same snapshot as any other read in the transaction — fixed at BEGIN.
This prevents the “Halloween problem”: rows inserted by this INSERT have
txn_id_created = current_txn_id. The snapshot was taken before any insert
occurred, so snapshot_id ≤ current_txn_id. The MVCC visibility rule
(txn_id_created < snapshot_id) causes newly inserted rows to be invisible to
the SELECT scan. The result:
- If
source = target(inserting from a table into itself): the SELECT sees exactly the rows that existed atBEGIN. The inserted copies are not re-scanned. No infinite loop. - If another transaction inserts rows into
sourceafter this transaction’sBEGIN: those rows are also invisible (consistent snapshot).
Before BEGIN: source = {row1, row2}
After BEGIN: snapshot_id = 3 (max_committed = 2)
INSERT INTO source SELECT * FROM source:
SELECT sees: {row1 (xmin=1), row2 (xmin=2)} — both have xmin < snapshot_id ✅
Inserts: row3 (xmin=3), row4 (xmin=3) — xmin = current_txn_id = 3
SELECT does NOT see row3 or row4 (xmin ≮ snapshot_id) ✅
After COMMIT: source = {row1, row2, row3, row4} ← exactly 2 new rows, not infinite
Subquery Execution
Subquery execution is integrated into the expression evaluator via the
SubqueryRunner trait. This design allows the compiler to eliminate all subquery
dispatch overhead in the non-subquery path at zero runtime cost.
SubqueryRunner Trait
#![allow(unused)]
fn main() {
pub trait SubqueryRunner {
fn eval_scalar(&mut self, subquery: &SelectStmt) -> Result<Value, DbError>;
fn eval_in(&mut self, subquery: &SelectStmt, needle: &Value) -> Result<Value, DbError>;
fn eval_exists(&mut self, subquery: &SelectStmt) -> Result<bool, DbError>;
}
}
All expression evaluation is dispatched through eval_with<R: SubqueryRunner>:
#![allow(unused)]
fn main() {
pub fn eval_with<R: SubqueryRunner>(
expr: &Expr,
row: &Row,
runner: &mut R,
) -> Result<Value, DbError>
}
Two concrete implementations exist:
| Implementation | Purpose |
|---|---|
NoSubquery | Used for simple expressions with no subqueries. All three SubqueryRunner methods are unreachable!(). Monomorphization guarantees they are dead code. |
ExecSubqueryRunner<'a> | Used when the query contains at least one subquery. Holds mutable references to storage, the transaction manager, and the outer row for correlated access. |
SubqueryRunner as a generic trait parameter — rather than a runtime Option<&mut dyn FnMut> or a boolean flag — allows the compiler to generate two separate code paths: eval_with::<NoSubquery> and eval_with::<ExecSubqueryRunner>. In the NoSubquery path, every subquery branch is dead code and is eliminated by LLVM. A runtime option would add a pointer-width check plus a potential indirect call on every expression node evaluation, even for the 99% of expressions that have no subqueries.
Scalar Subquery Evaluation
ExecSubqueryRunner::eval_scalar executes the inner SelectStmt fully using
the existing execute_select path, then inspects the result:
eval_scalar(subquery):
result = execute_select(subquery, storage, txn)
match result.rows.len():
0 → Value::Null
1 → result.rows[0][0] // single column, single row
n > 1 → Err(CardinalityViolation { returned: n })
The inner SELECT is always run with a fresh output context. It inherits the outer transaction snapshot so it sees the same consistent view as the outer query.
IN Subquery Evaluation
eval_in materializes the subquery result into a HashSet<Value>, then applies
three-valued logic:
eval_in(subquery, needle):
rows = execute_select(subquery)
values: HashSet<Value> = rows.map(|r| r[0]).collect()
if values.contains(needle):
return Value::Bool(true)
if values.contains(Value::Null):
return Value::Null // unknown — could match
return Value::Bool(false)
For NOT IN, the calling code wraps the result: TRUE → FALSE, FALSE → TRUE,
NULL → NULL (NULL propagates unchanged).
EXISTS Evaluation
eval_exists executes the subquery and checks whether the result set is non-empty.
No rows are materialized beyond the first:
eval_exists(subquery):
rows = execute_select(subquery)
return !rows.is_empty() // always bool, never null
Correlated Subqueries — substitute_outer
Before executing a correlated subquery, ExecSubqueryRunner walks the subquery
AST and replaces every Expr::OuterColumn { col_idx, depth: 1 } with a concrete
Expr::Literal(value) from the current outer row. This operation is called
substitute_outer:
substitute_outer(expr_tree, outer_row):
for each node in expr_tree:
if node = OuterColumn { col_idx, depth: 1 }:
replace with Literal(outer_row[col_idx])
if node = OuterColumn { col_idx, depth: d > 1 }:
decrement depth by 1 // pass through for deeper nesting
After substitution, the subquery is a fully self-contained statement with no
outer references, and it is executed by the standard execute_select path.
Re-execution happens once per outer row: for a correlated EXISTS in a query
that produces 10,000 outer rows, the inner query is executed 10,000 times.
For large datasets, rewriting as a JOIN is recommended.
Derived Table Execution
A derived table (FROM (SELECT ...) AS alias) is materialized once at the
start of query execution, before any scan or filter of the outer query begins:
execute_select(stmt):
for each TableRef::Derived { subquery, alias } in stmt.from:
materialized[alias] = execute_select(subquery) // fully materialized in memory
// outer query scans materialized[alias] as if it were a base table
The materialized result is an in-memory Vec<Row> wrapped in a
MaterializedTable. The outer query uses the derived table’s output schema
(column names from the inner SELECT list) for column resolution.
Derived tables are not correlated — they cannot reference columns from the outer
query. Lateral joins (which allow correlation in FROM) are not yet supported.
Foreign Key Enforcement
FK constraints are validated during DML operations by crates/axiomdb-sql/src/fk_enforcement.rs.
Catalog Storage
Each FK is stored as a FkDef row in the axiom_foreign_keys heap (5th system table,
root page at meta offset 84). Fields:
fk_id, child_table_id, child_col_idx, parent_table_id, parent_col_idx,
on_delete: FkAction, on_update: FkAction, fk_index_id: u32, name: String
FkAction encoding: 0=NoAction, 1=Restrict, 2=Cascade, 3=SetNull, 4=SetDefault.
fk_index_id != 0 → FK auto-index exists (composite key, Phase 6.9).
fk_index_id = 0 → no auto-index; enforcement falls back to full table scan.
FK auto-index — composite key (fk_val | RecordId) (Phase 6.9)
Each FK constraint auto-creates a B-Tree index on the child FK column using a composite key format that makes every entry globally unique:
key = encode_index_key(&[fk_val]) ++ encode_rid(rid) (10 bytes RecordId suffix)
This follows InnoDB’s approach of appending the primary key as a tiebreaker
(row0row.cc). Every entry is unique even when many rows share the same FK value.
Range scan for all children with a given parent key:
#![allow(unused)]
fn main() {
lo = encode_index_key(&[parent_key]) ++ [0x00; 10] // smallest RecordId
hi = encode_index_key(&[parent_key]) ++ [0xFF; 10] // largest RecordId
children = BTree::range_in(fk_index_root, lo, hi) // O(log n + k)
}
INSERT / UPDATE child — check_fk_child_insert
For each FK on the child table:
1. FK column is NULL → skip (MATCH SIMPLE)
2. Encode FK value as B-Tree key
3. Find parent's PK or UNIQUE index covering parent_col_idx
4. Bloom shortcut: if filter says absent → ForeignKeyViolation immediately
5. BTree::lookup_in(parent_index_root, key) — O(log n)
6. No match → ForeignKeyViolation (SQLSTATE 23503)
PK indexes are populated on every INSERT since Phase 6.9 (removed !is_primary
filter in insert_into_indexes). All index types now use B-Tree + Bloom lookup.
DELETE parent — enforce_fk_on_parent_delete
Called before the parent rows are deleted. For each FK referencing this table:
| Action | Behavior |
|---|---|
| RESTRICT / NO ACTION | BTree::range_in(fk_index) — O(log n); error if any child found |
| CASCADE | Range scan finds all children; recursive delete (depth limit = 10) |
| SET NULL | Range scan finds all children; updates FK column to NULL |
Cascade recursion uses depth parameter — exceeding 10 levels returns
ForeignKeyCascadeDepth (SQLSTATE 23503).
Query Planner Cost Gate (Phase 6.10)
Before returning IndexLookup or IndexRange, plan_select applies a cost gate
using per-column statistics to decide if the index scan is worth the overhead.
Algorithm
ndv = stats.ndv > 0 ? stats.ndv : DEFAULT_NUM_DISTINCT (= 200)
selectivity = 1.0 / ndv // equality predicate: 1/ndv rows match
if selectivity > 0.20:
return Scan // too many rows — full scan is cheaper
if stats.row_count < 1,000:
return Scan // tiny table — index overhead not worth it
return IndexLookup / IndexRange // selective enough — use index
Constants derived from PostgreSQL:
INDEX_SELECTIVITY_THRESHOLD = 0.20(PG default:seq/random_page_cost = 0.25; AxiomDB is slightly more conservative for embedded storage)DEFAULT_NUM_DISTINCT = 200(PGDEFAULT_NUM_DISTINCTinselfuncs.c)
Stats are loaded once per SELECT
In execute_select_ctx, before calling plan_select:
#![allow(unused)]
fn main() {
let table_stats = CatalogReader::new(storage, snap)?.list_stats(table_id)?;
let access_method = plan_select(where_clause, indexes, columns, table_id,
&table_stats, &mut ctx.stats);
}
If table_stats is empty (pre-6.10 database or ANALYZE never run),
plan_select conservatively uses the index — never wrong, just possibly suboptimal.
Staleness (StaleStatsTracker)
StaleStatsTracker lives in SessionContext and tracks row changes per table:
INSERT / DELETE row → on_row_changed(table_id)
changes > 20% of baseline → mark stale
planner loads stats → set_baseline(table_id, row_count)
ANALYZE TABLE → mark_fresh(table_id)
When stale, the planner uses ndv = DEFAULT_NUM_DISTINCT = 200 regardless of
catalog stats, preventing stale low-NDV estimates from causing full scans on
high-selectivity columns.
Bloom Filter — Index Lookup Shortcut
The executor holds a BloomRegistry (one per database connection) that maps
index_id → Bloom<Vec<u8>>. Before performing any B-Tree lookup for an index
equality predicate, the executor consults the filter:
#![allow(unused)]
fn main() {
// In execute_select_ctx — IndexLookup path
if !bloom.might_exist(index_def.index_id, &encoded_key) {
// Definite absence: skip B-Tree entirely.
return Ok(vec![]);
}
// False positive or true positive: proceed with B-Tree.
BTree::lookup_in(storage, index_def.root_page_id, &encoded_key)?
}
BloomRegistry API
#![allow(unused)]
fn main() {
pub struct BloomRegistry { /* per-index filters */ }
impl BloomRegistry {
pub fn create(&mut self, index_id: u32, expected_items: usize);
pub fn add(&mut self, index_id: u32, key: &[u8]);
pub fn might_exist(&self, index_id: u32, key: &[u8]) -> bool;
pub fn mark_dirty(&mut self, index_id: u32);
pub fn remove(&mut self, index_id: u32);
}
}
might_exist returns true (conservative) for unknown index_ids — correct
behavior for indexes that existed before the current server session (no filter
was populated for them at startup).
DML Integration
Every DML handler in the execute_with_ctx path updates the registry:
| Handler | Bloom action |
|---|---|
execute_insert_ctx | bloom.add(index_id, &key) after each B-Tree insert |
execute_update_ctx | mark_dirty() for delete side (batch); add() for insert side |
execute_delete_ctx | mark_dirty(index_id) once per index batch (5.19) |
execute_create_index | create(index_id, n) then add() for every existing key |
execute_drop_index | remove(index_id) |
Memory Budget
Each filter is sized at max(2 × expected_items, 1000) with a 1% FPR target
(~9.6 bits/key, 7 hash functions). For a 1M-row table with one secondary index:
2M × 9.6 bits ≈ 2.4 MB.
ANALYZE TABLE (Phase 6.12),
mirroring PostgreSQL's lazy statistics-rebuild model.
IndexOnlyScan — Heap-Free Execution
When plan_select returns AccessMethod::IndexOnlyScan, the executor reads
all result values directly from the B-Tree key bytes, with only a lightweight
MVCC visibility check against the heap slot header.
This section applies to the heap executor path. Since 39.15, clustered tables
do not execute this path directly even if the planner initially detects a
covering opportunity. Clustered covering plans are normalized back to
clustered-aware lookup/range access, because clustered visibility lives in the
inline row header and clustered secondary indexes carry PK bookmarks instead of
stable heap RecordIds.
Clustered UPDATE (39.16)
Clustered tables no longer fall back to heap-era UPDATE logic. The executor now
routes explicit-PRIMARY KEY tables through clustered candidate discovery and
clustered rewrite primitives:
- discover candidates through the clustered access planner: PK lookup, PK range, secondary bookmark probe, or full clustered scan
- capture the exact old clustered row image (
RowHeader+ full logical row bytes) before any mutation - choose one of three clustered write paths:
- same-key in-place rewrite via
clustered_tree::update_in_place(...) - same-key relocation via
clustered_tree::update_with_relocation(...) - PK change via
delete_mark(old_pk)+insert(new_pk, ...)
- same-key in-place rewrite via
- rewrite clustered secondary bookmark entries and register both index-insert and index-delete undo records so rollback can restore the old bookmark state
Clustered DELETE (39.17)
Clustered tables no longer fall back to heap-era DELETE logic either. The
executor now routes explicit-PRIMARY KEY tables through clustered candidate
discovery and clustered delete-mark primitives:
- discover candidates through the clustered access planner: PK lookup, PK range, secondary bookmark probe, or full clustered scan
- decode the exact current clustered row image before any mutation
- enforce parent-side foreign-key restrictions before the first delete-mark
- call
clustered_tree::delete_mark(...)for each matched primary key - record
EntryType::ClusteredDeleteMarkwith the exact old and new row images so rollback/savepoints restore the original header and payload bytes - leave clustered secondary bookmark entries in place for deferred cleanup during later clustered VACUUM work
Clustered VACUUM (39.18)
Clustered tables now have their own executor-visible maintenance path too.
VACUUM table_name dispatches by table storage layout:
- compute
oldest_safe_txn = max_committed + 1 - descend once to the leftmost clustered leaf
- walk the
next_leafchain and remove cells whosetxn_id_deletedis safe - free any overflow chain owned by each purged cell
- defragment the leaf when freeblock waste exceeds the page-local threshold
- scan each clustered secondary index, decode the PK bookmark from the physical secondary key, and keep only entries whose clustered row still exists physically after the leaf purge
- persist any secondary root rotation caused by bulk delete back into the catalog
Execution Path
IndexOnlyScan { index_def, lo, hi, n_key_cols, needed_key_positions }:
for (rid, key_bytes) in BTree::range_in(storage, index_def.root_page_id, lo, hi):
page_id = rid.page_id
slot_id = rid.slot_id
// MVCC: read only the 24-byte RowHeader — no full row decode.
visible = HeapChain::is_slot_visible(storage, page_id, slot_id, snap)
if !visible:
continue
// Extract column values from B-Tree key bytes (no heap page needed).
(decoded_cols, _) = decode_index_key(&key_bytes, n_key_cols)
// Project only the columns the query requested.
row = needed_key_positions.iter().map(|&p| decoded_cols[p].clone()).collect()
emit row
The 24-byte RowHeader contains txn_id_created, txn_id_deleted, and a
sequence number — enough for full MVCC visibility evaluation without loading
the row payload.
decode_index_key — Self-Delimiting Key Decoder
decode_index_key lives in key_encoding.rs and is the exact inverse of
encode_index_key. It uses type tags embedded in the key bytes to self-delimit
each value without needing an external schema:
| Tag byte | Type | Encoding |
|---|---|---|
0x00 | NULL | tag only, 0 payload bytes |
0x01 | Bool | tag + 1 byte (0 = false, 1 = true) |
0x02 | Int (positive, 1 B) | tag + 1 LE byte |
0x03 | Int (positive, 2 B) | tag + 2 LE bytes |
0x04 | Int (positive, 4 B) | tag + 4 LE bytes |
0x05 | Int (negative, 4 B) | tag + 4 LE bytes (i32) |
0x06 | BigInt (positive, 1 B) | tag + 1 byte |
0x07 | BigInt (positive, 4 B) | tag + 4 LE bytes |
0x08 | BigInt (positive, 8 B) | tag + 8 LE bytes |
0x09 | BigInt (negative, 8 B) | tag + 8 LE bytes (i64) |
0x0A | Real | tag + 8 LE bytes (f64 bits) |
0x0B | Text | tag + NUL-terminated UTF-8 (NUL = end marker) |
0x0C | Bytes | tag + NUL-escaped bytes (0x00 → [0x00, 0xFF], NUL terminator = [0x00, 0x00]) |
#![allow(unused)]
fn main() {
// Signature
pub fn decode_index_key(key: &[u8], n_cols: usize) -> (Vec<Value>, usize)
// Returns: (decoded column values, total bytes consumed)
}
The self-delimiting format means decode_index_key requires no column type
metadata — the tag bytes carry all necessary type information. This is the
same approach used by SQLite’s record format and RocksDB’s comparator-encoded
keys.
Full-Width Row Layout in IndexOnlyScan Output
IndexOnlyScan emits full-width rows — the same width as a heap row — with
index key column values placed at their table col_idx positions and NULL
everywhere else. This is required because downstream operators (WHERE
re-evaluation, projection, expression evaluator) all address columns by their
original table column index, not by SELECT output position.
table: (id INT [0], name TEXT [1], age INT [2], dept TEXT [3])
index: ON (age, dept) ← covers col_idx 2 and 3
IndexOnlyScan emits: [NULL, NULL, <age_val>, <dept_val>]
col0 col1 col2 col3
If the executor placed decoded values at positions 0, 1, ... instead, a
WHERE age > 25 re-evaluation would read col_idx=2 from a 2-element row and
panic with ColumnIndexOutOfBounds. The full-width layout eliminates this class
of error entirely.
execute_with_ctx — Required for IndexOnlyScan Selection
The planner selects IndexOnlyScan only when select_col_idxs (the set of
columns touched by the query) is a subset of the index’s key columns. The
select_col_idxs argument is supplied by execute_with_ctx; the simpler
execute entry-point passes an empty slice, so IndexOnlyScan is never selected
through it.
Test coverage for this path lives in
crates/axiomdb-sql/tests/integration_index_only.rs — functions prefixed
test_ctx_ use execute_with_ctx with real select_col_idxs and are the
only tests that exercise the IndexOnlyScan access method end-to-end.
Non-Unique Secondary Index Key Format
Non-unique secondary indexes append a 10-byte RecordId suffix to every
B-Tree key to guarantee uniqueness across all entries:
key = encode_index_key(col_vals) || encode_rid(rid)
^^^^^^^^^^^^^^
page_id (4 B) + slot_id (2 B) + seq (4 B) = 10 bytes
This prevents DuplicateKey errors when two rows share the same indexed value,
because the RecordId suffix always makes the full key distinct.
Lookup Bounds for Non-Unique Indexes
To find all rows matching a specific indexed value, the executor performs a
range scan using synthetic [lo, hi] bounds that span all possible RecordId
suffixes:
#![allow(unused)]
fn main() {
lo = encode_index_key(&[val]) + [0x00; 10] // smallest RecordId
hi = encode_index_key(&[val]) + [0xFF; 10] // largest RecordId
BTree::range_in(root, lo, hi) // returns all entries for val
}
row0row.cc). AxiomDB uses
RecordId (page_id + slot_id + sequence) instead of a separate
primary key column, keeping the suffix at a fixed 10 bytes regardless of the
table's key type — simpler to encode and guaranteed to be globally unique within
the storage engine's address space.
Performance Characteristics
| Operation | Time complexity | Notes |
|---|---|---|
| Table scan | O(n) | HeapChain linear traversal |
| Nested loop JOIN | O(n × m) | Both sides materialized before loop |
| Hash GROUP BY | O(n) | One pass; O(k) memory where k = distinct groups |
| Sorted GROUP BY | O(n) | One pass; O(1) accumulator memory per group |
| Sort ORDER BY | O(n log n) | sort_by (stable, in-memory) |
| DISTINCT | O(n) | One HashSet pass |
| LIMIT/OFFSET | O(1) after sort | skip(offset).take(limit) |
All operations are in-memory for Phase 4. External sort and hash spill for large datasets are planned for Phase 14 (vectorized execution).
AUTO_INCREMENT Execution
Per-Table Sequence State
Each table that has an AUTO_INCREMENT column maintains a sequence counter.
The counter is stored as a thread-local HashMap<String, i64> keyed by table
name, lazily initialized on the first INSERT:
auto_increment_next(table_name):
if table_name not in thread_local_map:
max_existing = MAX(id) from HeapChain scan, or 0 if table is empty
thread_local_map[table_name] = max_existing + 1
value = thread_local_map[table_name]
thread_local_map[table_name] += 1
return value
The MAX+1 lazy-init strategy means the sequence is always consistent with
existing data, even after rows are inserted by a previous session or after
a crash recovery.
MAX+1
is compatible with either approach.
Explicit Value Bypass
When the INSERT column list includes the AUTO_INCREMENT column with a non-NULL value, the explicit value is used directly and the sequence counter is not advanced:
for each row to insert:
if auto_increment_col in provided_columns:
value = provided value // bypass — no counter update
else:
value = auto_increment_next(table_name)
session.last_insert_id = value // update only for generated IDs
LAST_INSERT_ID() is updated only when a value is auto-generated. Inserting
an explicit ID does not change the session’s last_insert_id.
Multi-Row INSERT
For INSERT INTO t VALUES (...), (...), ..., the executor calls
auto_increment_next once per row. last_insert_id is set to the value
generated for the first row before iterating through the rest:
ids = [auto_increment_next(t) for _ in rows]
session.last_insert_id = ids[0] // MySQL semantics
insert all rows with their respective ids
TRUNCATE — Sequence Reset
TRUNCATE TABLE t deletes all rows by scanning the HeapChain and marking
every visible row as deleted (same algorithm as DELETE FROM t without a
WHERE clause). After clearing the rows, it resets the sequence:
execute_truncate(table_name):
for row in HeapChain::scan_visible(table_name, snapshot):
storage.delete_row(row.record_id, txn_id)
thread_local_map.remove(table_name) // next insert re-initializes from MAX+1 = 1
return QueryResult::Affected { count: 0 }
Removing the entry from the map forces a MAX+1 re-initialization on the next
INSERT. Because the table is now empty, MAX = 0, so next = 1.
SHOW TABLES / SHOW COLUMNS
SHOW TABLES
SHOW TABLES [FROM schema] reads the catalog’s table registry and returns one
row per table. The output column is named Tables_in_<schema>:
execute_show_tables(schema):
tables = catalog.list_tables(schema)
column_name = "Tables_in_" + schema
return QueryResult::Rows { columns: [column_name], rows: [[t] for t in tables] }
SHOW COLUMNS / DESCRIBE
SHOW COLUMNS FROM t, DESCRIBE t, and DESC t are all dispatched to the
same handler. The executor reads the column definitions from the catalog and
constructs a fixed six-column result set:
execute_show_columns(table_name):
cols = catalog.get_table(table_name).columns
for col in cols:
Field = col.name
Type = col.data_type.to_sql_string()
Null = if col.nullable { "YES" } else { "NO" }
Key = if col.is_primary_key { "PRI" } else { "" }
Default = "NULL" // stub
Extra = if col.auto_increment { "auto_increment" } else { "" }
return six-column result set
The Key and Default fields are stubs: Key only reflects primary key
membership; composite keys, unique constraints, and foreign keys are not yet
surfaced. Default always shows "NULL" regardless of the declared default
expression. Full metadata exposure is planned for a later catalog enhancement.
ALTER TABLE Execution
ALTER TABLE dispatches to one of five handlers depending on the operation. Three of them (ADD COLUMN, DROP COLUMN, and MODIFY COLUMN) require rewriting every row in the table. The other two (RENAME COLUMN and RENAME TO) touch only the catalog.
Why Row Rewriting Is Needed
AxiomDB rows are stored as positional binary blobs. The null bitmap at the
start of each row has exactly ceil(column_count / 8) bytes — one bit per
column, in column-index order. Packed values follow immediately, with offsets
derived from the column types declared at write time.
Row layout (schema: id BIGINT, name TEXT, age INT):
null_bitmap (1 byte) [b0=id_null, b1=name_null, b2=age_null, ...]
id (8 bytes, LE i64) [only present if b0=0]
name (4-byte len + UTF-8 bytes) [only present if b1=0]
age (4 bytes, LE i32) [only present if b2=0]
When the column count changes, the null bitmap size changes and all subsequent offsets shift. A row written under the old schema cannot be decoded against the new schema — the null bitmap has the wrong number of bits, and value positions no longer align. Every row must therefore be rewritten to match the new layout.
RENAME COLUMN does not change column positions or types — only the name entry
in the catalog changes. RENAME TO changes only the table name in the catalog.
Neither operation touches row data.
rewrite_rows Helper
ADD COLUMN, DROP COLUMN, and MODIFY COLUMN all use a shared rewrite_rows
dispatch. The implementation branches on storage format:
Heap tables:
rewrite_rows(table_name, old_schema, new_schema, transform_fn):
snapshot = txn.active_snapshot()
old_rows = HeapChain::scan_visible(table_name, snapshot)
for (record_id, old_row) in old_rows:
new_row = transform_fn(old_row)? // apply per-operation transformation
storage.delete_row(record_id, txn_id)
storage.insert_row(table_name, encode_row(new_row, new_schema), txn_id)
Clustered tables (rewrite_rows_clustered):
Clustered tables cannot use heap delete+reinsert because clustered_tree::insert
rejects duplicate primary keys even when the previous row is delete-marked. Instead,
each row is rewritten in place using update_with_relocation:
rewrite_rows_clustered(table_id, old_schema, new_schema, transform_fn):
snapshot = txn.active_snapshot()
rows = clustered_tree::range(table_id, Unbounded, Unbounded, snapshot)
for ClusteredRow { key, row_header, row_data } in rows:
old_row = decode_row(row_data, old_schema)
new_row = transform_fn(old_row)?
new_data = encode_row(new_row, new_schema)
txn.record_clustered_update(table_id, key, row_header+row_data, new_data)
new_root = clustered_tree::update_with_relocation(key, new_data)
if let Some(new_root_pid) = new_root {
catalog.set_root_page(table_id, new_root_pid)
}
update_with_relocation tries an in-place rewrite of the leaf slot. If the new
row is larger and the leaf page is full, it falls back to physical delete + reinsert
at the correct leaf position (no duplicate-key issue because the old entry is
physically removed before the new one is inserted).
The transform_fn is operation-specific and returns Result<Row, DbError> so
coercion failures abort the entire statement:
| Operation | transform_fn |
|---|---|
| ADD COLUMN | Append DEFAULT value (or NULL if no default) to the end of the row |
| DROP COLUMN | Remove the value at col_idx from the row vector |
| MODIFY COLUMN | Replace value at col_idx with coerce(value, new_type, Strict)? |
Ordering Constraint — Catalog Before vs. After Rewrite
The ordering of the catalog update relative to the row rewrite is not arbitrary. It is chosen so that a failure mid-rewrite leaves the database in a recoverable state:
ADD COLUMN — catalog update FIRST, then rewrite rows:
1. catalog.add_column(table_name, new_column_def)
2. rewrite_rows(old_schema → new_schema, append DEFAULT)
If the process crashes after step 1 but before step 2 completes, the catalog already reflects the new schema. The partially-rewritten rows are discarded by crash recovery (their transactions are uncommitted). On restart, the table is consistent: the new column exists in the catalog, and all rows either have been fully rewritten (if the transaction committed) or none have been (if it was rolled back).
DROP COLUMN — rewrite rows FIRST, then update catalog:
1. rewrite_rows(old_schema → new_schema, remove col at col_idx)
2. catalog.remove_column(table_name, col_idx)
If the process crashes after step 1 but before step 2, the rows have already been written in the new (narrower) layout but the catalog still shows the old schema. Recovery rolls back the uncommitted row rewrites and the catalog is never touched — the table is fully consistent under the old schema.
MODIFY COLUMN — rewrite rows FIRST (with strict coercion), then update catalog:
1. Guard: column not in secondary index (type change would break key encoding)
2. Guard: PK column cannot become nullable on clustered table
3. rewrite_rows(old_schema → new_schema, coerce(val, new_type, Strict)?)
4. catalog.delete_column(table_id, col_idx)
5. catalog.create_column(new_ColumnDef) // same col_idx, new type/nullable
If coercion fails for any row (e.g. TEXT → INT on a non-numeric value), the
error is returned immediately and no rows are changed. The statement is atomic:
either all rows are coerced successfully or none are.
The invariant is: the catalog always describes rows that can be decoded. Swapping the order for either operation would create a window where the catalog describes a schema that does not match the on-disk rows.
clustered_tree::insert rejects duplicate primary keys even when the previous
entry is delete-marked. Instead, rewrite_rows_clustered uses
update_with_relocation: it rewrites the leaf slot in place, falling back to
physical relocate-and-reinsert only when the new row is larger and the leaf has
no room. This avoids the duplicate-key restriction entirely and keeps the PK-keyed
B+ tree consistent throughout the rewrite.
Session Cache Invalidation
The session holds a SchemaCache that maps table names to their column
definitions at the time the last query was prepared. After any ALTER TABLE
operation completes, the cache entry for the affected table is invalidated:
execute_alter_table(stmt):
// ... perform operation (catalog update + optional row rewrite) ...
session.schema_cache.invalidate(table_name)
This ensures that the next query against the altered table re-reads the catalog and sees the updated column list, rather than operating on a stale schema that may reference columns that no longer exist or omit newly added ones.
Index root invalidation on B+tree split
The SchemaCache also stores IndexDef.root_page_id for each index. When an
INSERT causes the B+tree root to split, insert_in allocates a new root page
and frees the old one. After this, the cached root_page_id points to a freed
page. If the cache is not invalidated, the next execute_insert_ctx call reads
IndexDef.root_page_id from the cache and passes it to BTree::lookup_in
(uniqueness check), causing a stale-pointer read on a freed page.
The fix: call ctx.invalidate_all() whenever any index root changes during
INSERT or DELETE index maintenance. This forces re-resolution from the catalog
(which always has the current root_page_id) on the next DML statement.
Since 5.19, DELETE and the old-key half of UPDATE no longer mutate indexes in
a per-row loop. They collect exact encoded keys per index, sort them, and call
delete_many_in(...) once per affected tree. The cache-invalidation rule still
matters, but the synchronization point moved:
- batch-delete old keys per index
- persist the final root once for that index
- update the in-memory
current_indexesslice - invalidate the session cache once after the statement
For UPDATE there is a second root-sync point: after the batch delete phase, the reinsertion half must start from the post-delete root, not from the stale root captured before the batch. Otherwise reinserting new keys after a root collapse would descend from a freed page.
#![allow(unused)]
fn main() {
// DELETE / UPDATE old-key batch
let updated = delete_many_from_indexes(...)?;
for (index_id, new_root) in updated {
catalog.update_index_root(index_id, new_root)?;
current_indexes[i].root_page_id = new_root;
}
// UPDATE new-key insert phase
let ins_updated = insert_into_indexes(¤t_indexes, ...)?;
}
Stable-RID UPDATE Fast Path (5.20)
5.19 removed the old-key delete bottleneck, but UPDATE still paid the full
heap delete+insert path even when the new row could fit in the existing slot.
5.20 adds a second branch:
for each matched row:
old_row = ...
new_row = apply_set_assignments(old_row)
if encoded(new_row) fits in old slot:
rewrite tuple in place
rid stays identical
only maintain indexes whose logical key/predicate membership changed
else:
fallback to delete + insert
rid changes
treat affected indexes as before
The heap rewrite path is page-grouped. Rows that share a heap page are batched so
the executor reads the page once, rewrites all eligible slots, then writes the page
once. WAL records this branch as EntryType::UpdateInPlace, storing the old and new
tuple images for the same (page_id, slot_id).
This does not implement PostgreSQL HOT chains or forwarding pointers. The Phase 5 rule is narrower and cheaper to reason about: same-slot rewrite only, otherwise fall back to the existing delete+insert path.
Clustered UPDATE In-Place Zero-Alloc Fast Path (Phase 39.22)
fused_clustered_scan_patch in executor/update.rs implements a zero-allocation
UPDATE fast path for clustered tables when all SET columns are fixed-size.
Allocation audit
| Allocation | Before 39.22 | After 39.22 |
|---|---|---|
cell.row_data.to_vec() (phase-1 offset scan) | 1× per matched row | ❌ eliminated |
patched_data = ...clone() (phase-2 mutation) | 1× per matched row | ❌ eliminated |
encode_cell_image() in overflow path | 1× per matched row | ✅ overflow-only |
FieldDelta.old_bytes: Vec<u8> | 1× per changed field | ❌ → [u8;8] inline |
FieldDelta.new_bytes: Vec<u8> | 1× per changed field | ❌ → [u8;8] inline |
For 25K rows with 1 changed column each: ~125K heap allocations → 0.
Two-phase borrow pattern
The Rust borrow checker requires releasing the immutable page borrow before taking a mutable one. The fast path uses a split-phase approach:
Read phase (immutable borrow on page):
1. cell_row_data_abs_off(&page, idx) → (row_data_abs_off, key_len)
2. compute_field_location_runtime(row_slice, bitmap) → FieldLocation
3. MAYBE_NOP: if page_bytes[field_abs..][..size] == new_encoded[..size] { skip }
4. Capture old_buf: [u8;8] and new_encoded: [u8;8] on the stack
Write phase (mutable borrow after immutable dropped):
5. patch_field_in_place(&mut page, field_abs, new_buf[..size])
6. update_row_header_in_place(&mut page, idx, &new_header)
MAYBE_NOP (byte-identity check)
If the new encoded bytes are byte-identical to the existing page bytes
(e.g., SET score = score * 1 after integer multiplication), the field is
skipped entirely — no WAL delta, no header bump, no page write for that field.
This is an O(size) byte comparison (~4–8 bytes) before any mutation.
Overflow fallback
Cells with overflow_first_page.is_some() are rare (<1% of typical workloads)
and fall back to the existing rewrite_cell_same_key_with_overflow path
(full cell re-encode). The fast path only applies to inline cells.
btr_cur_upd_rec_in_place) still allocates an undo record per row for ROLLBACK support. AxiomDB's UndoClusteredFieldPatch stores undo data as inline [u8;8] arrays in the undo log entry — zero heap allocation per row even for ROLLBACK support. For a 25K-row UPDATE t SET score = score + 1, this reduces allocator pressure from ~125K allocs to zero.
Strict Mode and Warning 1265
SessionContext.strict_mode is a bool flag (default true) that controls
how INSERT and UPDATE column coercion failures are handled.
Coercion paths
INSERT / UPDATE column value assignment:
if ctx.strict_mode:
coerce(value, target_type, CoercionMode::Strict)
→ Ok(v) : use v
→ Err(e) : return Err immediately (SQLSTATE 22018)
else:
coerce(value, target_type, CoercionMode::Strict)
→ Ok(v) : use v (no warning — strict succeeded)
→ Err(_) : try CoercionMode::Permissive
→ Ok(v) : use v, emit ctx.warn(1265, "Data truncated for column '<col>' at row <n>")
→ Err(e): return Err (both paths failed)
CoercionMode::Permissive performs best-effort conversion: '42abc' → 42,
'abc' → 0, overflowing integers clamped to the type bounds.
Row numbering
insert_row_with_ctx and insert_rows_batch_with_ctx accept an explicit
row_num: usize (1-based). The VALUES loop in execute_insert_ctx passes
row_idx + 1 from enumerate():
#![allow(unused)]
fn main() {
for (row_idx, value_exprs) in rows.into_iter().enumerate() {
let values = eval_value_exprs(value_exprs, ...)?;
engine.insert_row_with_ctx(&mut ctx, values, row_idx + 1)?;
}
}
This makes warning 1265 messages meaningful for multi-row inserts:
"Data truncated for column 'stock' at row 2".
SET strict_mode / SET sql_mode
The executor intercepts SET strict_mode and SET sql_mode in execute_set_ctx
(called from dispatch_ctx). It delegates to helpers from session.rs:
#![allow(unused)]
fn main() {
"strict_mode" => {
let b = parse_boolish_setting(&raw)?;
ctx.strict_mode = b;
}
"sql_mode" => {
let normalized = normalize_sql_mode(&raw);
ctx.strict_mode = sql_mode_is_strict(&normalized);
}
}
The wire layer (handler.rs) syncs the wire-visible @@sql_mode and
@@strict_mode variables with the session bool after every intercepted SET.
Both variables are surfaced in SHOW VARIABLES.
'42' →
42) never generate a warning in either mode — matching MySQL 8's
behavior where warning 1265 is reserved for actual data loss, not clean widening.
Roadmap and Phases
AxiomDB is developed in phases, each of which adds a coherent vertical slice of functionality. The design is organized in three blocks:
- Block 1 (Phases 1–7): Core engine — storage, indexing, WAL, transactions, SQL parsing, and concurrent MVCC.
- Block 2 (Phases 8–14): SQL completeness — full query planner, optimizer, advanced SQL features, and MySQL wire protocol.
- Block 3 (Phases 15–34): Production hardening — replication, backups, distributed execution, column store, and AI/ML integration.
Current Status
Last completed subphase: 40.1b CREATE INDEX on clustered tables — removed ensure_heap_runtime guard; CREATE INDEX / CREATE UNIQUE INDEX now work on clustered (PRIMARY KEY) tables using ClusteredSecondaryLayout-based index build with partial index, NULL-skipping, and uniqueness enforcement at build time.
Active development: Phase 40 — Clustered engine performance optimizations (40.1 ClusteredInsertBatch done; 40.1b CREATE INDEX on clustered tables done; statement plan cache, transaction write set, vectorized scan next)
Next milestone: 40.2 — Statement plan cache (per-session CachedPlanSource with OID-based invalidation)
Concurrency note: the current server already supports concurrent read-only
queries, but mutating statements are still serialized through a database-wide
Arc<RwLock<Database>> write guard. The next concurrency milestone is
Phase 13.7 row-level locking, followed by deadlock detection and explicit
locking clauses.
Phase Progress
Block 1 — Core Engine
| Phase | Name | Status | Key deliverables |
|---|---|---|---|
| 1.1 | Workspace setup | ✅ | Cargo workspace, crate structure |
| 1.2 | Page format | ✅ | 16 KB pages, header, CRC32c checksum |
| 1.3 | MmapStorage | ✅ | mmap-backed storage engine |
| 1.4 | MemoryStorage | ✅ | In-memory storage for tests |
| 1.5 | FreeList | ✅ | Bitmap page allocator |
| 1.6 | StorageEngine trait | ✅ | Unified interface + heap pages |
| 2.1 | B+ Tree insert/split | ✅ | CoW insert with recursive splits |
| 2.2 | B+ Tree delete | ✅ | Rebalance, redistribute, merge |
| 2.3 | B+ Tree range scan | ✅ | RangeIter with tree traversal |
| 2.4 | Prefix compression | ✅ | CompressedNode for internal keys |
| 3.1 | WAL entry format | ✅ | Binary format, CRC32c, backward scan |
| 3.2 | WAL writer | ✅ | WalWriter with file header |
| 3.3 | WAL reader | ✅ | Forward and backward iterators |
| 3.4 | TxnManager | ✅ | BEGIN/COMMIT/ROLLBACK, snapshot |
| 3.5 | Checkpoint | ✅ | 5-step checkpoint protocol |
| 3.6 | Crash recovery | ✅ | CRASHED→RECOVERING→REPLAYING→VERIFYING→READY |
| 3.7 | Durability tests | ✅ | 9 crash scenarios |
| 3.8 | Post-recovery checker | ✅ | Heap structural + MVCC invariants |
| 3.9 | Catalog bootstrap | ✅ | axiom_tables, axiom_columns, axiom_indexes |
| 3.10 | Catalog reader | ✅ | MVCC-aware schema lookup |
| 3.17 | WAL batch append | ✅ | record_insert_batch(): O(1) write_all for N entries via reserve_lsns+write_batch |
| 3.18 | WAL PageWrite | ✅ | EntryType::PageWrite=9: 1 WAL entry/page vs N/row; 238× fewer for 10K-row insert |
| 3.19 | WAL Group Commit | ✅ | CommitCoordinator: batches fsyncs across connections; up to 16× concurrent throughput |
| 4.1 | SQL AST | ✅ | All statement types |
| 4.2 | SQL lexer | ✅ | logos DFA, ~85 tokens, zero-copy |
| 4.3 | DDL parser | ✅ | CREATE/DROP/ALTER TABLE, CREATE/DROP INDEX |
| 4.4 | DML parser | ✅ | SELECT (all clauses), INSERT, UPDATE, DELETE |
| 4.17 | Expression evaluator | ✅ | Three-valued NULL logic, all operators |
| 4.18 | Semantic analyzer | ✅ | BindContext, col_idx resolution |
| 4.18b | Type coercion matrix | ✅ | coerce(), coerce_for_op(), CoercionMode strict/permissive |
| 4.23 | QueryResult type | ✅ | Row, ColumnMeta, QueryResult (Rows/Affected/Empty) |
| 4.5b | Table engine | ✅ | TableEngine scan/insert/delete/update over heap; later generalized by Phase 39 table-root metadata |
| 4.5 + 4.5a | Basic executor | ✅ | SELECT/INSERT/UPDATE/DELETE, DDL, txn control, SELECT without FROM |
| 4.25 + 4.7 | Error handling framework | ✅ | Complete SQLSTATE mapping; ErrorResponse{sqlstate,message,detail,hint} |
| 4.8 | JOIN (nested loop) | ✅ | INNER/LEFT/RIGHT/CROSS; USING; multi-table; FULL→NotImplemented |
| 4.9a+4.9c+4.9d | GROUP BY + Aggregates + HAVING | ✅ | COUNT/SUM/MIN/MAX/AVG; hash-based; HAVING; NULL grouping |
| 4.10+4.10b+4.10c | ORDER BY + LIMIT/OFFSET | ✅ | Multi-column; NULLS FIRST/LAST; LIMIT/OFFSET pagination |
| 4.12 | DISTINCT | ✅ | HashSet dedup on output rows; NULL=NULL; pre-LIMIT |
| 4.24 | CASE WHEN | ✅ | Searched + simple form; NULL semantics; all contexts |
| 4.6 | INSERT … SELECT | ✅ | Reuses execute_select; MVCC prevents self-reads |
| 6.1–6.3 | Secondary indexes + planner | ✅ | CREATE INDEX, index maintenance, B-Tree point/range lookup |
| 6.4 | Bloom filter per index | ✅ | BloomRegistry; zero B-Tree reads for definite-absent keys (1% FPR) |
| 6.5/6.6 | Foreign key constraints | ✅ | REFERENCES, ALTER TABLE FK; INSERT/DELETE/CASCADE/SET NULL enforcement |
| 6.7 | Partial UNIQUE index | ✅ | CREATE INDEX … WHERE predicate; soft-delete uniqueness pattern |
| 6.8 | Fill factor | ✅ | WITH (fillfactor=N) on CREATE INDEX; B-Tree leaf split at ⌈FF×ORDER_LEAF/100⌉ |
| 6.9 | FK + Index improvements | ✅ | PK B-Tree population; FK composite key index; composite index planner |
| 6.10–6.12 | Index statistics + ANALYZE | ✅ | Per-column NDV/row_count; planner cost gate (sel > 20% → Scan); ANALYZE command; staleness tracking |
| 6.16 | PK SELECT planner parity | ✅ | PRIMARY KEY equality/range now participate in single-table SELECT planning; PK equality bypasses the scan-biased cost gate |
| 6.17 | Indexed UPDATE candidate path | ✅ | UPDATE now discovers PK / indexed candidates through B-Tree access before entering the 5.20 write path |
| 6.18 | Indexed multi-row INSERT batch path | ✅ | Immediate multi-row VALUES statements now reuse grouped heap/index apply on indexed tables while preserving strict same-statement UNIQUE semantics |
| 6.19 | WAL fsync pipeline | 🔄 | Server commits now use an always-on leader-based fsync pipeline and the old timer-based CommitCoordinator path is gone, but the single-connection insert_autocommit benchmark still misses target throughput |
| 6.20 | UPDATE apply fast path | ✅ | PK-range UPDATE now batches candidate heap reads, skips no-op rows, batches UpdateInPlace WAL writes, and groups per-index delete+insert/root persistence |
| 5 | Executor (advanced) | ⚠️ Planned | JOIN, GROUP BY, ORDER BY, index lookup, aggregate |
| 6.8+ | Index statistics, FK improvements | ⚠️ Planned | Fill factor, composite FKs, ON UPDATE CASCADE, ANALYZE, index-only scans |
| 7 | Full MVCC | ⚠️ Planned | SSI, write-write conflicts, epoch reclamation |
Block 2 — SQL Completeness
| Phase | Name | Status | Key deliverables |
|---|---|---|---|
| 8 | Advanced SQL | ⚠️ Planned | Window functions, CTEs, recursive queries |
| 9 | VACUUM / GC | ⚠️ Planned | Dead row cleanup, freelist compaction |
| 10 | MySQL wire protocol | ⚠️ Planned | COM_QUERY, result set packets, handshake |
| 11 | TOAST | ⚠️ Planned | Out-of-line storage for large values |
| 12 | Full-text search | ⚠️ Planned | Inverted index, BM25 ranking |
| 13 | Foreign key checks | ⚠️ Planned | Constraint validation on insert/delete |
| 14 | Vectorized execution | ⚠️ Planned | SIMD scans, morsel-driven pipeline |
Block 3 — Production Hardening
| Phase | Name | Status |
|---|---|---|
| 15 | Connection pooling | ⚠️ Planned |
| 16 | Replication (primary-replica) | ⚠️ Planned |
| 17 | Point-in-time recovery (PITR) | ⚠️ Planned |
| 18 | Online DDL | ⚠️ Planned |
| 19 | Partitioning | ⚠️ Planned |
| 20 | Column store (HTAP) | ⚠️ Planned |
| 21 | VECTOR index (ANN) | ⚠️ Planned |
| 22–34 | Distributed, cloud-native, AI/ML | ⚠️ Future |
Block 4 — Platform Surfaces and Storage Evolution
| Phase | Name | Status | Key deliverables |
|---|---|---|---|
| 35 | Deployment and DevEx | ⚠️ Planned | Docker, config tooling, release UX |
| 36 | AxiomQL Core | ⚠️ Planned | Alternative read query language over the same AST/executor |
| 37 | AxiomQL Write + DDL + Control | ⚠️ Planned | AxiomQL DML, DDL, control flow, maintenance |
| 38 | AxiomDB-Wasm | ⚠️ Planned | Browser runtime, OPFS backend, sync, live queries |
| 39 | Clustered index storage engine | 🔄 In progress | Inline PK rows, clustered internal/leaf pages, PK bookmarks in secondary indexes, logical clustered WAL/rollback, clustered crash recovery, clustered-aware CREATE TABLE |
Completed Phases — Summary
Phase 1 — Storage Engine
A generic storage layer with two implementations: MmapStorage for production disk
use and MemoryStorage for tests. Every higher-level component uses only the
StorageEngine trait — storage is pluggable. Pages are 16 KB with a 64-byte header
(magic, page type, CRC32c checksum, page_id, LSN, free pointers). Heap pages use a
slotted format: slots grow from the start, tuples grow from the end toward the center.
Phase 2 — B+ Tree CoW
A persistent, Copy-on-Write B+ Tree over StorageEngine. Keys up to 64 bytes;
ORDER_INTERNAL = 223, ORDER_LEAF = 217 (derived to fill exactly one 16 KB page).
Root is an AtomicU64 — readers are lock-free by design. Supports insert (with
recursive split), delete (with rebalance/redistribute/merge), and range scan via
RangeIter. Prefix compression for internal nodes in memory.
Phase 3 — WAL and Transactions ✅ 100% complete
Append-only Write-Ahead Log with binary entries, CRC32c checksums, and forward/backward
scan iterators. TxnManager coordinates BEGIN/COMMIT/ROLLBACK with snapshot assignment.
Five-step checkpoint protocol. Crash recovery state machine (five states). Catalog
bootstrap creates the three system tables on first open. CatalogReader provides
MVCC-consistent schema reads. Nine crash scenario tests with a post-recovery integrity
checker.
Phase 3 late additions (3.17–3.19):
-
3.17 WAL batch append —
record_insert_batch()usesWalWriter::reserve_lsns(N)+write_batch()to write N Insert WAL entries in a singlewrite_allcall. Reduces BufWriter overhead from O(N rows) to O(1) for bulk inserts. -
3.18 WAL PageWrite —
EntryType::PageWrite = 9. One WAL entry per affected heap page instead of one per row.new_value= post-modification page bytes (16 KB) + embedded slot IDs for crash recovery undo. For a 10K-row bulk insert: 42 WAL entries instead of 10,000 — 238× fewer serializations and 30% smaller WAL file. -
3.19 WAL Group Commit —
CommitCoordinatorbatches DML commits from concurrent connections. DML commits write to the WAL BufWriter, register with the coordinator, and release the Database lock before awaiting fsync confirmation. A background Tokio task performs oneflush+fsyncper batch window (group_commit_interval_ms), then notifies all waiting connections. Enables near-linear concurrent write scaling.
Phase 4 — SQL Processing
SQL AST covering all DML (SELECT, INSERT, UPDATE, DELETE) and DDL (CREATE/DROP/ALTER
TABLE, CREATE/DROP INDEX). logos-based lexer with ~85 tokens, case-insensitive keywords,
zero-copy identifiers. Recursive descent parser with full expression precedence. Expression
evaluator with three-valued NULL logic (AND, OR, NOT, IS NULL, BETWEEN, LIKE, IN).
Semantic analyzer with BindContext, qualified/unqualified column resolution, ambiguity
detection, and subquery support. Row codec with null bitmap, u24 string lengths, and
O(n) encoded_len().
Near-Term Priorities
Phase 13 — Row-Level Writer Concurrency
The current implementation uses Arc<tokio::sync::RwLock<Database>>: reads can
overlap, but mutating statements are still serialized at whole-database scope.
Phase 13.7 removes that bottleneck with row-level locking. Phase 13.8 adds
deadlock detection, and 13.8b adds SELECT ... FOR UPDATE, NOWAIT, and
SKIP LOCKED.
Phase 5
Phase 5 is now complete. The last close was:
- 5.15 DSN parsing — AxiomDB-owned surfaces now accept typed DSNs:
AXIOMDB_URLfor server bootstrap plusDb::open_dsn,AsyncDb::open_dsn, andaxiomdb_open_dsnfor embedded mode.mysql://andpostgres://are parse aliases only; the server still speaks MySQL wire only and embedded mode still accepts only local-path DSNs.
Phase 5 also closed the recent runtime/perf subphases:
- 5.11c Explicit connection state machine — the MySQL server now has an explicit
CONNECTED → AUTH → IDLE → EXECUTING → CLOSINGtransport lifecycle with fixed auth timeout,wait_timeoutvsinteractive_timeoutbehavior,net_write_timeoutfor packet writes, and socket keepalive configured separately from SQL session state. - 5.19a Executor decomposition — the SQL executor now lives in a responsibility-based
executor/module tree instead of one monolithic file, which lowers the cost of later DML and planner work. - 5.19 B+Tree batch delete — DELETE WHERE and the old-key half of UPDATE now
stage exact encoded keys per index and remove them with one ordered
delete_many_in(...)pass per tree instead of onedelete_in(...)traversal per row. - 5.19b Eval decomposition — the expression evaluator now lives under a
responsibility-based
eval/module tree with the same public API, which lowers the cost of future built-in and collation work without changing SQL behavior. - 5.20 Stable-RID UPDATE fast path — UPDATE can now rewrite rows in the same
heap slot when the new encoded row fits, preserve the
RecordId, and skip unnecessary index maintenance for indexes whose logical key membership is unchanged. - 5.21 Transactional INSERT staging — explicit transactions now buffer
consecutive
INSERT ... VALUESstatements per table and flush them together onCOMMITor the next barrier statement, preserving savepoint semantics by flushing before the next statement savepoint whenever the batch cannot continue.
Phase 6 closing note — Integrity and recovery
Phase 6 now closes with startup index integrity verification:
- every catalog-visible index is compared against heap-visible rows after WAL recovery
- readable divergence is repaired automatically from heap contents
- unreadable index trees fail open with
IndexIntegrityFailure
SQL REINDEX remains deferred to the later diagnostics / administration phases.
Phase 6 closing note — Indexed multi-row INSERT on indexed tables
Phase 6 also closes the remaining immediate multi-row VALUES debt on indexed tables:
- shared batch-apply helpers are now reused by both
5.21staging flushes and the immediateINSERT ... VALUES (...), (... )path - PRIMARY KEY and secondary indexes no longer force a per-row fallback for multi-row VALUES statements
- same-statement UNIQUE detection remains strict because the immediate path does
not reuse the staged
committed_emptyshortcut
- Index range scan — range predicate via
RangeIter. - Projection — evaluate SELECT expressions over rows from the scan.
- Filter — apply WHERE expression using the evaluator from Phase 4.17.
- Nested loop join — INNER JOIN, LEFT JOIN.
- Sort — ORDER BY with NULLS FIRST/LAST.
- Limit/Offset — LIMIT n OFFSET m.
- Hash aggregate — GROUP BY with COUNT, SUM, AVG, MIN, MAX.
- INSERT / UPDATE / DELETE — write path with WAL integration.
The executor will be a simple volcano-model interpreter in Phase 5. Vectorized execution (morsel-driven, SIMD) is planned for Phase 14.
AxiomQL — Alternative Query Language (Phases 36-37)
AxiomDB will support two query languages sharing one AST and executor:
SQL stays as the primary language with full wire protocol compatibility. Every ORM, client, and tool works without changes.
AxiomQL is an optional method-chain alternative designed to be learned in
minutes by any developer who already uses .filter().sort().take() in JavaScript,
Python, Rust, or C#:
users
.filter(active, age > 18)
.join(orders)
.group(country, total: count())
.sort(total.desc)
.take(10)
Both languages compile to the same Stmt AST — zero executor overhead, every SQL
feature automatically available in AxiomQL. Planned after Phase 8 (wire protocol).
| Phase | Scope |
|---|---|
| 36 | AxiomQL parser: SELECT, filter, join, group, subqueries, let bindings |
| 37 | AxiomQL write + DDL: insert, update, delete, create, transaction, proc |
Benchmarks
All benchmarks run on Apple M2 Pro (12 cores), 32 GB RAM, NVMe SSD, single-threaded, warm data (all pages in OS page cache unless noted). Criterion.rs is used for all micro-benchmarks; each measurement is the mean of at least 100 samples.
Reference values for MySQL 8 and PostgreSQL 15 are measured in-process (no network), without WAL for pure codec/parser operations. Operations that include WAL (INSERT, UPDATE) are directly comparable.
SQL Parser
| Benchmark | AxiomDB | sqlparser-rs | MySQL ~ | PostgreSQL ~ | Verdict |
|---|---|---|---|---|---|
| Simple SELECT (1 table) | 492 ns | 4.8 µs | ~500 ns | ~450 ns | ✅ parity with PG |
| Complex SELECT (multi-JOIN) | 2.7 µs | 46 µs | ~4.0 µs | ~3.5 µs | ✅ 1.3× faster than PG |
| CREATE TABLE | 1.1 µs | 14.5 µs | ~2.5 µs | ~2.0 µs | ✅ 1.8× faster than PG |
| Batch (100 statements) | 47 µs | — | ~90 µs | ~75 µs | ✅ 1.6× faster than PG |
vs sqlparser-rs: 9.8× faster on simple SELECT, 17× faster on complex SELECT.
The speed advantage comes from two decisions:
- logos DFA lexer — compiles token patterns to a Deterministic Finite Automaton at build time. Scanning runs in O(n) time with 1–3 CPU instructions per byte.
- Zero-copy tokens —
Identtokens are&'src strslices into the original input. No heap allocation occurs during lexing or AST construction.
B+ Tree Index
| Benchmark | AxiomDB | MySQL ~ | PostgreSQL ~ | Target | Max acceptable | Verdict |
|---|---|---|---|---|---|---|
| Point lookup (1M rows) | 1.2M ops/s | ~830K ops/s | ~1.1M ops/s | 800K ops/s | 600K ops/s | ✅ |
| Range scan 10K rows | 0.61 ms | ~8 ms | ~5 ms | 45 ms | 60 ms | ✅ |
| Insert (sequential keys) | 195K ops/s | ~150K ops/s | ~120K ops/s | 180K ops/s | 150K ops/s | ✅ |
| Sequential scan 1M rows | 0.72 s | ~0.8 s | ~0.5 s | 0.8 s | 1.2 s | ✅ |
| Concurrent reads ×16 | linear | ~2× degradation | ~1.5× degradation | linear | <2× degradation | ✅ |
Why point lookup is fast: the CoW B+ Tree root is an AtomicU64. Readers load it
with Acquire and traverse 3–4 levels of 16 KB pages that are already in the OS page
cache. No mutex, no RWLock.
Why range scan is very fast: RangeIter re-traverses from the root to locate
each successive leaf after exhausting the current one. With CoW, next_leaf pointers
cannot be maintained consistently (a split copies the leaf, leaving the previous leaf’s
pointer stale). Tree retraversal costs O(log n) per leaf boundary crossing — at 3–4
levels deep this is 3–5 page reads, all already in the OS page cache for sequential
workloads. The deferred next_leaf fast path (Phase 7) will reduce this to O(1) per
boundary once epoch-based reclamation is available.
SELECT ... WHERE pk = literal After 6.16
Phase 6.16 fixes the planner gap that still prevented single-table SELECT
from using the PRIMARY KEY B+Tree. The executor already supported IndexLookup
and IndexRange; the missing piece was planner eligibility plus a forced path
for PK equality.
Measured with:
python3 benches/comparison/local_bench.py --scenario select_pk --rows 5000 --table
| Operation | MariaDB 12.1 | MySQL 8.0 | AxiomDB |
|---|---|---|---|
SELECT * FROM bench_users WHERE id = literal | 12.7K lookups/s | 13.4K lookups/s | 11.1K lookups/s |
This closes the old “full scan on PK lookup” debt. The remaining gap is no longer planner-side; it is now in SQL/wire overhead after the PK B+Tree path is already active.
Row Codec
| Benchmark | Throughput | Notes |
|---|---|---|
encode_row | 33M rows/s | 5-column mixed-type row |
decode_row | 28M rows/s | Same layout |
encoded_len | O(n), no alloc | Size computation without buffer allocation |
The codec encodes a null bitmap (1 bit per column, packed into bytes) followed by the column payloads in declaration order. Variable-length types use a 3-byte (u24) length prefix. Fixed-size types (integers, floats, DATE, TIMESTAMP, UUID) have no length prefix.
Expression Evaluator
| Benchmark | AxiomDB | MySQL ~ | PostgreSQL ~ | Verdict |
|---|---|---|---|---|
| Expr eval over 1K rows | 14.8M rows/s | ~8M rows/s | ~6M rows/s | ✅ 1.9× faster than MySQL |
The evaluator is a recursive interpreter over the Expr enum. Speed comes from
inlining the hot path (column reads, arithmetic, comparisons) and from the fact
that col_idx is resolved once by the semantic analyzer — no name lookup at eval time.
Performance Budget
The following thresholds are enforced before any phase is closed. A result below the “Max acceptable” column is a blocker.
| Operation | AxiomDB | Target | Max acceptable | Phase measured |
|---|---|---|---|---|
| Point lookup PK | 1.2M ops/s ✅ | 800K ops/s | 600K ops/s | 2 |
| Range scan 10K rows | 0.61 ms ✅ | 45 ms | 60 ms | 2 |
| B+ Tree INSERT (storage only) | 195K ops/s ✅ | 180K ops/s | 150K ops/s | 3 |
| INSERT end-to-end 10K batch (SchemaCache) | 36K ops/s ⚠️ | 180K ops/s | 150K ops/s | 4.16b |
| SELECT via wire protocol (autocommit) | 185 q/s ✅ | — | — | 5.14 |
| INSERT via wire protocol (autocommit) | 58 q/s — | — | — | 5.14 |
| Sequential scan 1M rows | 0.72 s ✅ | 0.8 s | 1.2 s | 2 |
| Concurrent reads ×16 | linear ✅ | linear | <2× degradation | 2 |
| Parser — simple SELECT | 492 ns ✅ | 600 ns | 1 µs | 4 |
| Parser — complex SELECT | 2.7 µs ✅ | 3 µs | 6 µs | 4 |
| Row codec encode | 33M rows/s ✅ | — | — | 4 |
| Expr eval (scan 1K rows) | 14.8M rows/s ✅ | — | — | 4 |
Executor end-to-end (Phase 4.16b, MmapStorage + real WAL, full pipeline)
Measured with cargo bench --bench executor_e2e -p axiomdb-sql (Apple M2 Pro, NVMe,
release build). Pipeline: parse → analyze → execute → WAL → MmapStorage.
| Configuration | AxiomDB | Target (Phase 8) | Notes |
|---|---|---|---|
| INSERT 100 rows / 1 txn (no SchemaCache) | 2.8K ops/s | — | cold path, catalog scan |
| INSERT 1K rows / 1 txn (no SchemaCache) | 18.5K ops/s | — | amortization starts |
| INSERT 1K rows / 1 txn (SchemaCache) | 20.6K ops/s | — | +8% vs no cache |
| INSERT 10K rows / 1 txn (SchemaCache) | 36K ops/s | 180K ops/s | ⚠️ WAL bottleneck |
| INSERT autocommit (1 fsync/row) | 58 q/s | — | 1 fdatasync per statement (wire protocol, Phase 5.14) |
Root cause — WAL record_insert() dominates: each row write costs ~20 µs inside
record_insert() even without fsync. Parse + analyze cost per INSERT is ~1.5 µs total;
SchemaCache eliminates catalog heap scans but only improves throughput by 8% because WAL
overhead is already the dominant term. The 180K ops/s target is a Phase 8 goal: prepared
statements skip parse and analyze entirely, and a batch insert API will write one WAL entry
per batch rather than one per row.
record_insert(). This makes recovery straightforward — each row is an
independent, self-contained undo/redo unit — but costs ~20 µs/row at the WAL layer
regardless of fsync. The 36K ops/s ceiling at 10K batch size is a direct consequence of
this design. PostgreSQL and MySQL both offer bulk-load paths (COPY, LOAD DATA) that bypass
per-row WAL overhead; AxiomDB's equivalent is the Phase 8 batch insert API, which will
coalesce WAL entries and write them in a single sequential append.
B+ Tree storage-only INSERT (no SQL parsing, no WAL):
| Operation | AxiomDB | MySQL ~ | PostgreSQL ~ | Target | Max acceptable | Verdict |
|---|---|---|---|---|---|---|
| B+Tree INSERT (storage only) | 195K ops/s | ~150K ops/s | ~120K ops/s | 180K ops/s | 150K ops/s | ✅ |
The storage layer itself exceeds the 180K ops/s target. The gap between 195K (storage only) and 36K (full pipeline) isolates the overhead to the WAL record path, not the B+ Tree or the page allocator.
Run end-to-end benchmarks:
cargo bench --bench executor_e2e -p axiomdb-sql
# MySQL + PostgreSQL comparison (requires Docker):
./benches/comparison/setup.sh
python3 benches/comparison/bench_runner.py --rows 10000
./benches/comparison/teardown.sh
Phase 5.14 — Wire Protocol Throughput
Measured via the MySQL wire protocol (pymysql client, autocommit mode, 1 connection, localhost, Apple M2 Pro, NVMe SSD).
| Benchmark | AxiomDB | MySQL ~ | PostgreSQL ~ | Notes |
|---|---|---|---|---|
| COM_PING | 24,865/s | ~30K/s | ~25K/s | Pure protocol, no SQL engine |
| SET NAMES (intercepted) | 46,672/s | ~20K/s | — | Handled in protocol layer |
| SELECT 1 (autocommit) | 185 q/s | ~5K–15K q/s* | ~5K–12K q/s* | Full pipeline, read-only |
| INSERT (autocommit, 1 fsync/stmt) | 58 q/s | ~130–200 q/s* | ~100–160 q/s* | Full pipeline + fsync |
*MySQL/PostgreSQL figures are in-process estimates without network latency overhead. AxiomDB throughput measured over localhost with real round-trips; the gap reflects the current single-threaded autocommit path and will improve with Phase 5.13 plan cache and Phase 8 batch API.
Phase 5.14 fix — read-only WAL fsync eliminated:
Prior to Phase 5.14, every autocommit transaction called fdatasync on WAL commit,
including read-only queries such as SELECT. This cost 10–20 ms per SELECT, capping
throughput at ~56 q/s.
The fix: skip fdatasync (and the WAL flush) when the transaction has no DML operations
(undo_ops.is_empty()). Read-only transactions still flush buffered writes to the OS
(BufWriter::flush) so that concurrent readers see committed state, but they do not
wait for the fdatasync round-trip to persistent storage.
Before / after:
| Query | Before (5.13) | After (5.14) | Improvement |
|---|---|---|---|
| SELECT 1 (autocommit) | ~56 q/s | 185 q/s | 3.3× |
| INSERT (autocommit) | ~58 q/s | 58 q/s | no change (fsync required) |
fdatasync. For DML transactions
this is correct — data must reach persistent storage before the client receives OK.
For read-only transactions there is nothing to persist: the transaction produced no WAL
records. Skipping fdatasync for undo_ops.is_empty() transactions
is therefore safe: crash recovery cannot lose data that was never written. PostgreSQL applies
the same principle — read-only transactions in PostgreSQL do not touch the WAL at all.
The OS-level flush (BufWriter::flush) is kept so that any WAL bytes written by
a concurrent writer are visible to the OS before the SELECT returns, preserving read-after-write
consistency within the same process.
Bottleneck analysis:
- SELECT 185 q/s: each
COM_QUERYruns a full parse + analyze cycle (~1.5 µs) plus one wire protocol round-trip (~40 µs on localhost). The dominant cost is the round-trip. For prepared statements (COM_STMT_EXECUTE), Phase 5.13 plan cache eliminates the parse/analyze step entirely — the cached AST is reused and only a ~1 µs parameter substitution pass runs before execution. The remaining bottleneck for higher throughput is WAL transaction overhead per statement (BEGIN/COMMIT I/O); this will be addressed by Phase 6 indexed reads (eliminating full-table scans) and the Phase 8 batch API. - INSERT 58 q/s: one
fdatasyncper autocommit statement is required for durability.
Phase 5.21 — Transactional INSERT staging
Measured with python3 benches/comparison/local_bench.py --scenario insert --rows 50000 --table
against a release AxiomDB server and local MariaDB/MySQL instances on the same
machine. Workload: 50,000 separate one-row INSERT statements inside one
explicit transaction.
| Benchmark | MariaDB 12.1 | MySQL 8.0 | AxiomDB | Notes |
|---|---|---|---|---|
insert (single-row INSERTs in 1 txn) | 28.0K rows/s | 26.7K rows/s | 23.9K rows/s | one BEGIN, 50K INSERT statements, one COMMIT |
What changed in 5.21:
- the session now buffers consecutive eligible
INSERT ... VALUESrows for the same table instead of writing heap/WAL immediately - barriers such as
SELECT,UPDATE,DELETE, DDL,COMMIT, table switch, or ineligible INSERT shapes force a flush - the flush uses
insert_rows_batch_with_ctx(...)plus grouped post-heap index maintenance, persisting each changed index root once per flush
heap_multi_insert() and DuckDB's appender, but keeps SQL semantics intact by
flushing before the next statement savepoint whenever the batch cannot continue.
This is deliberately not the same as autocommit group commit. The benchmark
already uses one explicit transaction, so 5.21 attacks per-statement heap/WAL/index
work rather than fsync batching across multiple commits.
Phase 6.19 — WAL fsync pipeline
Measured with:
python3 benches/comparison/local_bench.py --scenario insert_autocommit --rows 1000 --table --engines axiomdb
Workload: one INSERT per transaction over the MySQL wire.
| Benchmark | AxiomDB | Target | Status |
|---|---|---|---|
insert_autocommit | 224 ops/s | >= 5,000 ops/s | ❌ |
What changed in 6.19:
- the old timer-based
CommitCoordinatorand its config knobs were removed - server DML commits now hand deferred durability to an always-on
leader-based
FsyncPipeline - queued followers can piggyback on a leader fsync when their
commit_lsnis already covered
What the benchmark taught us:
- the implementation is correct and wire-visible semantics remain intact
- but the target workload is sequential request/response autocommit
- the handler still waits for durability before it sends
OK - therefore the next statement cannot arrive while the current fsync is in flight, so single-connection piggyback never materializes
6.19 is closed as an implementation subphase, but this benchmark remains a
documented performance gap rather than a solved target.
Phase 6.18 — Indexed multi-row INSERT batch path
Measured with:
python3 benches/comparison/local_bench.py --scenario insert_multi_values --rows 5000 --table
Workload: multi-row INSERT ... VALUES (...), (... ) statements against the
benchmark schema with PRIMARY KEY (id).
| Operation | MariaDB 12.1 | MySQL 8.0 | AxiomDB |
|---|---|---|---|
insert_multi_values on PK table | 160,581 rows/s | 259,854 rows/s | 321,002 rows/s |
What changed in 6.18:
- the immediate multi-row
VALUESpath no longer checkssecondary_indexes.is_empty()before using grouped heap writes - grouped heap/index apply was extracted into shared helpers reused by both:
- the transactional staging flush from
5.21 - the immediate
INSERT ... VALUES (...), (... )path
- the transactional staging flush from
- the immediate path keeps strict UNIQUE semantics by not reusing the staged
committed_emptyshortcut, because same-statement duplicate keys must still fail without leaking partial rows
heap_multi_insert() and DuckDB's appender both separate row staging from physical write. AxiomDB borrows the grouped physical apply idea, but rejects a blind bulk-load shortcut on the immediate path: duplicate keys inside one SQL statement must still be rejected before any partial batch becomes visible.
Phase 6.20 — UPDATE apply fast path
Measured with python3 benches/comparison/local_bench.py --scenario update_range --rows 5000 --table
against a release AxiomDB server and local MariaDB/MySQL instances on the same
machine. Workload: UPDATE bench_users SET score = score + 1 WHERE id BETWEEN ...
on a PK-indexed table.
| Benchmark | MariaDB 12.1 | MySQL 8.0 | AxiomDB | Notes |
|---|---|---|---|---|
update_range | 618K rows/s | 291K rows/s | 369.9K rows/s | PK range UPDATE now stays on a batched read/apply path end-to-end |
What changed in 6.20:
IndexLookup/IndexRangecandidate rows are fetched throughread_rows_batch(...)instead of one heap read per RID- no-op UPDATE rows are filtered before heap/index mutation
- stable-RID rows batch
UpdateInPlaceWAL append withreserve_lsns(...) + write_batch(...) - UPDATE index maintenance now uses grouped delete+insert with one root persistence write per affected index
- both ctx and non-ctx UPDATE paths share a statement-level index bailout
This closes the dominant apply-side debt left behind after 6.17. The benchmark
improves by 4.3x over the 6.17 result (85.2K rows/s) and now beats the
documented local MySQL result on the same workload.
Phase 5.19 / 5.20 — DELETE WHERE and UPDATE Write Paths
Measured with python3 benches/comparison/local_bench.py --scenario all --rows 50000 --table
on the same Apple M2 Pro machine. The benchmark uses the MySQL wire protocol and a
bench_users table with PRIMARY KEY (id).
| Operation | MariaDB 12.1 | MySQL 8.0 | AxiomDB | PostgreSQL 16 |
|---|---|---|---|---|
DELETE WHERE id > 25000 | 652K rows/s | 662K rows/s | 1.13M rows/s | 3.76M rows/s |
UPDATE ... WHERE active = TRUE | 662K rows/s | 404K rows/s | 648K rows/s | 270K rows/s |
5.19 removed the old per-row delete_in(...) loop by batching exact encoded keys
per index through delete_many_in(...). 5.20 finished the UPDATE recovery by
preserving the original RID whenever the rewritten row still fits in the same slot.
For UPDATE, the before/after delta is the important signal:
- Post-
5.19/ pre-5.20:52.9K rows/s - Post-
5.20:648K rows/s
That is a ~12.2× improvement on the same workload.
Phase 5.13 — Prepared Statement Plan Cache
Phase 5.13 introduces an AST-level plan cache for prepared statements. The full parse +
analyze pipeline runs once at COM_STMT_PREPARE time; each subsequent
COM_STMT_EXECUTE performs only a tree walk to substitute parameter values (~1 µs)
and then calls execute_stmt() directly.
| Path | Parse + Analyze | Param substitution | Total SQL overhead |
|---|---|---|---|
COM_QUERY (text protocol) | ~1.5 µs per call | — | ~1.5 µs |
COM_STMT_EXECUTE before 5.13 | ~1.5 µs per call (re-parse) | string replace | ~1.5 µs |
COM_STMT_EXECUTE after 5.13 | 0 (cached) | ~1 µs AST walk | ~1 µs |
The ~0.5 µs saving per execute is meaningful for high-frequency statement patterns (e.g., ORM-generated queries that re-execute the same SELECT or INSERT with different parameters on every request).
Remaining bottleneck: the dominant cost per COM_STMT_EXECUTE is now the WAL
transaction overhead (BEGIN/COMMIT I/O) rather than parse/analyze. For read-only
prepared statements, Phase 6 indexed reads will eliminate full-table scans, reducing
the per-query execution cost. For write statements, the Phase 8 batch API will coalesce
WAL entries, targeting the 180K ops/s budget.
Stmt (AST with resolved column indices)
rather than the original SQL string. This means each execute avoids both lexing and
semantic analysis, not just parsing. The trade-off is that the cached AST must be
cloned before parameter substitution to avoid mutating shared state — a shallow clone
of the expression tree is ~200 ns, well below the ~1.5 µs that parse + analyze would
cost. MySQL and PostgreSQL cache parsed + planned query trees for the same reason.
Running Benchmarks Locally
# B+ Tree
cargo bench --bench btree -p axiomdb-index
# Storage engine
cargo bench --bench storage -p axiomdb-storage
# SQL parser
cargo bench --bench parser -p axiomdb-sql
# All benchmarks
cargo bench --workspace
# Compare before/after a change
cargo bench -- --save-baseline before
# ... make change ...
cargo bench -- --baseline before
# Detailed comparison with critcmp
cargo install critcmp
critcmp before after
Benchmarks use Criterion.rs and emit JSON results to target/criterion/. Each
run reports mean, standard deviation, min, max, and throughput (ops/s or bytes/s
depending on the benchmark).
Design Decisions
This page documents the most consequential architectural choices made during AxiomDB’s design. Each entry explains the alternatives considered, the reasoning, and the trade-offs accepted.
Query Languages
SQL + AxiomQL dual-language strategy
| Aspect | Decision |
|---|---|
| Chosen | Two query languages sharing one AST and executor |
| Alternatives | SQL only; AxiomQL only; SQL-to-AxiomQL transpiler |
| Phase | Phase 12+ (post wire protocol) |
SQL is the primary language. Full MySQL/PostgreSQL wire protocol compatibility. All ORMs, clients, and tools work without changes. Nothing breaks for anyone.
AxiomQL is an optional alternative — a method-chain query language for developers who prefer modern, readable syntax. It compiles to the same Stmt AST as SQL, so there is zero executor overhead and every SQL feature is automatically available in AxiomQL.
SQL ──────┐
├──► AST ──► Optimizer ──► Executor
AxiomQL ───┘
AxiomQL syntax reads top-to-bottom in the logical order of execution:
users
.filter(active, age > 18)
.join(orders)
.group(country, total: count())
.sort(total.desc)
.take(10)
This is already familiar to any developer who uses .filter().map().sort() in JavaScript, Python, Rust, or C#. The learning curve is ~10 minutes.
Why not SQL-only: SQL’s evaluation order (SELECT before FROM, HAVING separate from WHERE) is a 50-year-old quirk that confuses new users. AxiomQL removes the confusion without removing SQL.
Why not AxiomQL-only: Breaking compatibility with every MySQL client, ORM, and tool in existence would be unacceptable. SQL stays.
No existing database has this combination: ORMs like ActiveRecord and Eloquent are application-layer libraries, not native DB languages. PRQL compiles to SQL externally. EdgeQL is native but a different syntax family. AxiomQL would be the first native method-chain language that coexists with SQL in the same engine.
Storage
mmap over a Custom Buffer Pool
| Aspect | Decision |
|---|---|
| Chosen | memmap2::MmapMut — OS-managed page cache |
| Alternatives | Custom buffer pool (like InnoDB), io_uring direct I/O |
| Phase | Phase 1 (Storage Engine) |
Why mmap:
- The OS page cache provides LRU eviction, readahead prefetching, and dirty page write-back for free. Implementing these correctly in user space takes months of engineering work.
- Pages returned by
read_page()are&Pagereferences directly into the mapped memory — zero copy from kernel to application. - MySQL InnoDB maintains a separate buffer pool on top of the OS page cache. The same physical page lives in RAM twice (once in the kernel page cache, once in the buffer pool). mmap eliminates the second copy.
msync(MS_SYNC)provides the same durability guarantee asfsyncfor WAL and checkpoint flushes.
Trade-offs accepted:
- No fine-grained control over eviction policy (OS uses LRU; a custom pool could use clock-sweep with hot/cold zones).
- On 32-bit systems, mmap is limited by the address space. Not a concern for a modern 64-bit server database.
- mmap I/O errors manifest as
SIGBUSrather thanErr(...). These are handled with a signal handler that convertsSIGBUStoDbError::Io.
16 KB Page Size
| Aspect | Decision |
|---|---|
| Chosen | 16,384 bytes (16 KB) |
| Alternatives | 4 KB (SQLite), 8 KB (PostgreSQL), 8 KB (original db.md spec) |
| Phase | Phase 1 |
Why 16 KB:
- The B+ Tree ORDER constants (ORDER_INTERNAL = 223, ORDER_LEAF = 217) yield a highly efficient fan-out with 16 KB pages. At 4 KB, the order would be ~54 for internal nodes — requiring 4× more page reads for the same number of keys.
- At 16 KB, a tree covering 1 billion rows has depth 4. At 4 KB, depth 5 (25% more I/O for every lookup).
- OS readahead typically prefetches 128–512 KB, making 16 KB the sweet spot: small enough that random access is not wasteful, large enough for sequential workloads.
- 64-byte header leaves 16,320 bytes for the body — a natural fit for the
bytemuck::Podstructs that avoid alignment issues.
Indexing
Copy-on-Write B+ Tree
| Aspect | Decision |
|---|---|
| Chosen | CoW B+ Tree with AtomicU64 root swap |
| Alternatives | Traditional B+ Tree with read-write locks; LSM-tree (like RocksDB); Fractal tree |
| Phase | Phase 2 (B+ Tree) |
Why CoW B+ Tree:
- Readers are completely lock-free. A
SELECTon a billion-row table never blocks any concurrentINSERT,UPDATE, orDELETE. - MVCC is “built in” — readers hold a pointer to the old root and see a consistent snapshot of the tree, exactly as MVCC requires.
- No deadlocks are possible during tree traversal (locks are never held during reads).
- Writes amplify by O(log n) page copies, but at depth 4 this is 4 × 16 KB = 64 KB per insert — acceptable for the target workload (OLTP, not write-heavy OLAP).
Why not LSM:
- LSM-trees have superior write throughput (sequential I/O only) but inferior read performance (must check multiple levels). AxiomDB’s target is OLTP with read-heavy workloads. A B+ Tree point lookup is O(log n) I/Os; an LSM lookup is O(L) compaction levels, each potentially requiring a disk seek.
- Compaction in LSM introduces unpredictable write amplification spikes that are difficult to tune for latency-sensitive OLTP.
next_leaf Not Used in Range Scans
| Aspect | Decision |
|---|---|
| Chosen | Re-traverse from root to find the next leaf on each boundary crossing |
| Alternatives | Keep the next_leaf linked list consistent under CoW |
| Phase | Phase 2 |
Why: Under CoW, next_leaf pointers in old leaf pages point to other old pages
that may have been freed. Maintaining a consistent linked list under CoW requires
copying the previous leaf on every insert near a boundary — but the previous leaf’s
page_id is not known during a top-down write path without additional bookkeeping.
The cost of the adopted solution (O(log n) per leaf boundary) is acceptable: for a 10,000-row range scan across ~47 leaves (217 rows/leaf), there are 46 boundary crossings, each costing 4 page reads = 184 extra page reads. At a measured scan time of 0.61 ms for 10,000 rows, this is within the 45 ms budget by a factor of 73.
Durability
WAL Without Double-Write Buffer
| Aspect | Decision |
|---|---|
| Chosen | WAL with per-page CRC32c; no double-write buffer |
| Alternatives | Double-write buffer (MySQL InnoDB); full page WAL images (PostgreSQL) |
| Phase | Phase 3 (WAL) |
Why no double-write:
- MySQL writes each page twice: once to the doublewrite buffer and once to the actual position. The doublewrite buffer protects against torn writes (partial page writes due to power failure mid-write).
- AxiomDB protects against torn writes with a CRC32c checksum per page. If a page has an invalid checksum on startup, it is reconstructed from the WAL. This requires the WAL to contain the information needed for reconstruction — which it does (the WAL records the full new_value for each UPDATE/INSERT).
- Eliminating the double-write buffer halves the disk writes for every dirty page flush.
Trade-off: Recovery requires reading more WAL data. If many pages are corrupted (e.g., a full power failure after a long write batch), recovery replays more WAL entries. In practice, with modern UPS and filesystem journaling, full-file corruption is rare. The WAL’s CRC32c catches partial writes reliably.
Physical WAL (not Logical WAL)
| Aspect | Decision |
|---|---|
| Chosen | Physical WAL: records (page_id, slot_id, old_bytes, new_bytes) |
| Alternatives | Logical WAL: records SQL-level operations (INSERT INTO t VALUES…) |
| Phase | Phase 3 |
Why physical:
- Recovery is redo-only: replay each committed WAL entry at its exact physical location. No UNDO pass required (uncommitted changes are simply ignored).
- Physical location (page_id, slot_id) allows direct seek to the affected page — O(1) per WAL entry, not O(log n) B+ Tree traversal.
- The WAL key encodes
page_id:8 + slot_id:2in 10 bytes, making the physical location self-contained in the WAL record.
Trade-off: Physical WAL entries are larger than logical ones (they contain the full encoded row bytes, not a SQL expression). For a row with 100 bytes of data, the WAL entry is ~100 + 43 bytes overhead = ~143 bytes. A logical WAL entry might be smaller for simple inserts. However, the simplicity and speed of redo-only physical recovery outweighs the size difference.
SQL Processing
logos for Lexing
| Aspect | Decision |
|---|---|
| Chosen | logos crate — compiled DFA |
| Alternatives | nom combinators; pest PEG; hand-written lexer; lalrpop |
| Phase | Phase 4.2 (SQL Lexer) |
Why logos:
- logos compiles all token patterns (keywords, identifiers, literals) into a single DFA at build time. Runtime cost per character is a table lookup — 1–3 CPU instructions.
- The
ignore(ascii_case)attribute makes keyword matching case-insensitive with no runtime cost (the DFA is built with both cases folded). - Zero-copy:
Ident(&'src str)slices into the input without heap allocation. - Measured throughput: 9–17× faster than sqlparser-rs for the same inputs.
nom is an excellent choice for context-free parsing with backtracking but is over-engineered for a lexer: a lexer is a regular language (no backtracking needed), and DFA is the optimal algorithm for it.
Zero-Copy Tokens
| Aspect | Decision |
|---|---|
| Chosen | Token::Ident(&'src str) — lifetime-tied reference into the input |
| Alternatives | Token::Ident(String) — owned heap allocation; Token::Ident(Arc<str>) |
| Phase | Phase 4.2 |
Why zero-copy:
- Heap allocation per identifier would cost ~30 ns on modern hardware (involving a
malloccall). For a query with 20 identifiers, that is 600 ns of allocation overhead. - At 2M queries/s (the target throughput), 600 ns per query consumes 1.2 s per second of CPU time in allocations — impossible to sustain.
- Zero-copy tokens require the input string to outlive the token stream, which is a natural constraint: the input is always available until the query finishes.
MVCC Implementation
RowHeader in Heap Pages (not Undo Tablespace)
| Aspect | Decision |
|---|---|
| Chosen | MVCC metadata (xmin, xmax, deleted) in each heap row |
| Alternatives | Separate undo tablespace (MySQL InnoDB); version chain in B+ Tree (PostgreSQL MVCC heap) |
| Phase | Phase 3 (TxnManager) |
Why inline RowHeader:
- A historical row version is visible in its original heap location. No additional I/O is needed to read old versions — they are in the same page as the current version.
- MySQL’s undo tablespace (
ibdata1) requires additional I/O for reads that need old row versions (the reader follows a pointer chain from the clustered index into the undo tablespace). - Inline metadata is simpler to implement and audit.
Trade-offs:
- Dead rows occupy space in the heap until
VACUUM(Phase 9) cleans them up. - The
RowHeaderadds 24 bytes overhead per row. For a table with 50-byte average rows, this is 32% overhead. Acceptable for the generality it provides.
Collation
UCA Root as Default Collation
| Aspect | Decision |
|---|---|
| Chosen | Unicode Collation Algorithm (UCA) root for string comparison |
| Alternatives | ASCII byte order; locale-specific collation; C locale (PostgreSQL default) |
| Phase | Phase 4 (Types) |
Why UCA root:
- ASCII byte order (
strcmp) gives incorrect ordering for most non-English text: ‘ä’ sorts after ‘z’ in ASCII, but should sort near ‘a’. - UCA root is locale-neutral (deterministic across any server environment) while still correct for most languages.
- MySQL’s default collation (utf8mb4_general_ci) is not standards-compliant.
- UCA root is implemented by the
icucrate — same algorithm used by modern browsers forIntl.Collator.
WAL Optimization
Per-Page WAL Entries (PageWrite) vs Per-Row WAL Entries
| Aspect | Decision |
|---|---|
| Chosen | EntryType::PageWrite = 9: one WAL entry per heap page for bulk inserts |
| Alternatives | Per-row Insert entries (original approach); full redo log (PostgreSQL WAL) |
| Phase | Phase 3.18 |
Why per-page:
For bulk inserts (INSERT INTO t VALUES (r1),(r2),...), the per-row approach writes one
WAL entry per row: 10,000 rows = 10,000 serialize_into() calls + 10,000 CRC32c
computations. Per-page replaces these with ~42 entries (one per 16 KB page, holding ~240
rows each) — 238× fewer serializations and a 30% smaller WAL file.
The PageWrite entry format stores:
- The full post-modification page bytes (
new_value[0..PAGE_SIZE]) — available for future REDO-based power-failure recovery (Phase 3.8b). - The inserted slot IDs (
new_value[PAGE_SIZE+2..]) — used by crash recovery to undo uncommittedPageWriteentries by marking each slot dead, identical in effect to undoing N individualInsertentries.
Trade-offs accepted:
- Each
PageWriteentry is ~16 KB vs ~100 bytes for anInsertentry. For sparse inserts (a few rows per page),PageWriteis larger. The optimization only applies toinsert_rows_batch()(multi-row INSERT) — single-row inserts still useInsertentries. - Crash recovery must parse the embedded slot list instead of simply reading a single physical location. The parsing is O(num_slots) per entry — still O(N) total, identical asymptotic cost.
Why not a full redo log (like PostgreSQL WAL):
PostgreSQL writes a physical page image + logical redo records for every page modification.
Our PageWrite is a simplified version: we write only the post-image (for bulk inserts)
and rely on the existing in-memory undo log for rollback. Full redo would require per-page
LSNs and a replay pass on startup — reserved for Phase 3.8b.
txn_id_created == crashed_txn_id. We rejected this because it requires reading the page from storage during the crash recovery scan — before the undo phase even begins. Embedding the slot IDs in the PageWrite entry keeps crash recovery a pure WAL read pass: no storage I/O needed to determine what to undo.
Content-Addressed BLOB Storage (Planned Phase 6)
| Aspect | Decision |
|---|---|
| Planned | SHA-256 content address as the BLOB key in a dedicated BLOB store |
| Alternatives | Inline BLOB in the heap (PostgreSQL TOAST); external file reference |
| Phase | Phase 6 |
Why content-addressed:
- Two rows storing the same attachment (e.g., a company logo in every invoice) share exactly one copy on disk. Deduplication is automatic and requires no extra schema.
- The BLOB store is append-only with immutable entries — no locking on BLOB reads.
- Deletion is handled by reference counting: when the last row referencing a BLOB is deleted, the BLOB can be garbage collected.