SQL Executor

The executor is the component that interprets an analyzed Stmt (all column references resolved to col_idx by the semantic analyzer) and drives it to completion, returning a QueryResult. It is the highest-level component in the query pipeline.

Since subphase 5.19a, the executor no longer lives in a single source file. It is organized under crates/axiomdb-sql/src/executor/ with mod.rs as the stable facade and responsibility-based source files behind it.

Source Layout

File	Responsibility
`executor/mod.rs`	public facade, statement dispatch, thread-local last-insert-id
`executor/shared.rs`	helpers shared across multiple statement families
`executor/select.rs`	SELECT entrypoints, projection, ORDER BY/LIMIT wiring
`executor/select_core.rs`	core SELECT evaluation loop (heap + clustered paths)
`executor/select_ctx.rs`	SelectCtx builder and scan context
`executor/select_helpers.rs`	projection, column resolution, row materialization
`executor/select_joins_ctx.rs`	join context and nested-loop join helpers
`executor/aggregate.rs`	GROUP BY, aggregates, DISTINCT/group-key helpers
`executor/insert.rs`	INSERT and INSERT … SELECT paths
`executor/insert_heap_ctx.rs`	heap INSERT context and batch path
`executor/insert_helpers.rs`	AUTO_INCREMENT, constraint checks, default values
`executor/update_ctx.rs`	UPDATE context builder
`executor/update_candidates.rs`	candidate row collection for heap/clustered UPDATE
`executor/update_entry.rs`	single-row UPDATE application
`executor/update_clustered.rs`	clustered UPDATE orchestration
`executor/update_clustered_helpers.rs`	clustered UPDATE helpers (patch, zero-alloc fast path)
`executor/update_fused_range.rs`	fused clustered scan+patch for range UPDATE
`executor/delete.rs`	DELETE execution and candidate collection
`executor/bulk_empty.rs`	shared bulk-empty helpers for DELETE/TRUNCATE
`executor/staging.rs`	transactional INSERT staging flushes and barrier handling

⚙️

Design Decision — File-Level Split First The first goal of `5.19a` was to make the executor readable and changeable without changing SQL behavior. The split preserves the existing facade and keeps later DML optimizations isolated from unrelated SELECT and DDL code.

Integration Test Layout

The executor integration coverage no longer sits in one giant test binary. The current axiomdb-sql/tests/ layout is responsibility-based, mirroring the module split in src/executor/.

Binary	Main responsibility
`integration_executor`	CRUD base and simple transaction behavior
`integration_executor_joins`	JOINs and aggregate execution
`integration_executor_query`	`ORDER BY`, `LIMIT`, `DISTINCT`, `CASE`, `INSERT ... SELECT`, `AUTO_INCREMENT`
`integration_executor_ddl`	`SHOW`, `DESCRIBE`, `TRUNCATE`, `ALTER TABLE`
`integration_executor_ctx`	base `SessionContext` execution and `strict_mode`
`integration_executor_ctx_group`	ctx-path sorted group-by
`integration_executor_ctx_limit`	ctx-path `LIMIT` / `OFFSET` coercion
`integration_executor_ctx_on_error`	ctx-path `on_error` behavior
`integration_executor_sql`	broader SQL semantics outside the ctx path
`integration_delete_apply`	bulk and indexed DELETE apply paths
`integration_insert_staging`	transactional INSERT staging
`integration_namespacing`	database catalog behavior: `CREATE/DROP DATABASE`, `USE`, `SHOW DATABASES`
`integration_namespacing_cross_db`	explicit `database.schema.table` resolution and cross-db DML/DDL
`integration_namespacing_schema`	schema namespacing, `search_path`, and schema-aware `SHOW TABLES`

Shared helpers live in crates/axiomdb-sql/tests/common/mod.rs.

The day-to-day workflow is intentionally narrow:

start with the smallest binary that matches the code path you changed
add directly related binaries only when the change touches shared helpers or a nearby execution path
use cargo test -p axiomdb-sql --tests as the crate-level confidence gate, not as the default inner-loop command
if a new behavior belongs to an existing themed binary, add the test there instead of creating a new binary immediately

cargo test -p axiomdb-sql --test integration_executor_query
cargo test -p axiomdb-sql --test integration_executor_query test_insert_select_aggregation -- --exact

⚙️

Design Decision — Smallest Relevant Test First The test split follows the same idea as the executor source split: isolate one execution path per binary so everyday validation can run only the code that actually changed, while still keeping closely related behavior together in the same harness.

UPDATE Apply Fast Path (`6.20`)

6.17 fixed indexed UPDATE discovery, but the default update_range benchmark was still paying most of its cost after rows had already been found. 6.20 removes that apply-side overhead in four steps:

IndexLookup / IndexRange candidates are decoded through TableEngine::read_rows_batch(...), which groups RecordIds by page_id and restores the original RID order after each page is read once.
UPDATE evaluates the new row image before touching the heap and drops rows whose new_values == old_values.
Stable-RID rewrites accumulate (key, old_tuple_image, new_tuple_image, page_id, slot_id) and emit their normal UpdateInPlace WAL records through one record_update_in_place_batch(...) call.
If any index really is affected, UPDATE now does one grouped delete pass, one grouped insert pass, and one final root persistence write per index.

The coarse executor bailout is statement-level:

if all physically changed rows keep the same RID
   and no SET column overlaps any index key column
   and no SET column overlaps any partial-index predicate dependency:
    skip index maintenance for the statement

This is the common PK-only UPDATE score WHERE id BETWEEN ... case in local_bench.py.

⚙️

Borrowed Modified-Attrs Rule PostgreSQL HOT, SQLite's indexColumnIsBeingUpdated(), and MariaDB's clustered-vs-secondary UPDATE split all ask the same question: did any index-relevant attribute actually change? `6.20` adapts that rule directly, without adding HOT chains or change buffering.

🚀

Performance Advantage On the release `update_range` benchmark, `6.20` raises AxiomDB from 85.2K to 369.9K rows/s by removing per-row heap reads, per-row no-op rewrites, and per-row `UpdateInPlace` append overhead from the default PK-only path.

Entry Point

#![allow(unused)]
fn main() {
pub fn execute(
    stmt: Stmt,
    storage: &mut dyn StorageEngine,
    txn: &mut TxnManager,
) -> Result<QueryResult, DbError>
}

When no transaction is active, execute wraps the statement in an implicit BEGIN / COMMIT (autocommit mode). Transaction control statements (BEGIN, COMMIT, ROLLBACK) bypass autocommit and operate on TxnManager directly.

All reads use txn.active_snapshot()? — a snapshot fixed at BEGIN — so that writes made earlier in the same transaction are visible (read-your-own-writes).

Transactional INSERT staging (Phase 5.21)

5.21 adds a statement-boundary staging path for consecutive INSERT ... VALUES statements inside one explicit transaction.

Data structure

SessionContext now owns:

#![allow(unused)]
fn main() {
PendingInsertBatch {
    table_id: u32,
    table_def: TableDef,
    columns: Vec<ColumnDef>,
    indexes: Vec<IndexDef>,
    compiled_preds: Vec<Option<Expr>>,
    rows: Vec<Vec<Value>>,
    unique_seen: HashMap<u32, HashSet<Vec<u8>>>,
}
}

The buffer exists only while the connection is inside an explicit transaction. Autocommit-wrapped single statements do not use it.

Enqueue path

For every eligible INSERT row, executor/insert.rs does all logical work up front:

evaluate expressions
expand omitted columns
assign AUTO_INCREMENT if needed
run CHECK constraints
run FK child validation
reject duplicate UNIQUE / PK keys against:
- committed index state
- unique_seen inside the current batch
append the fully materialized row to PendingInsertBatch.rows

No heap write or WAL append happens yet.

Flush barriers

The batch is flushed before:

SELECT
UPDATE
DELETE
DDL
COMMIT
table switch to another INSERT target
any ineligible INSERT shape

ROLLBACK discards the batch without heap or WAL writes.

Savepoint ordering invariant

When a transaction uses statement-level savepoints (rollback_statement, savepoint, ignore), the executor must flush staged rows before taking the next statement savepoint if the current statement cannot continue the batch.

Without that ordering, a failing statement after a table switch could roll back rows that logically belonged to earlier successful INSERT statements.

⚙️

Design Decision — Flush Before Savepoint The first `5.21` implementation flushed a table-switch batch inside the next INSERT handler, which placed the flush after the statement savepoint and let a later duplicate-key error roll it back. The final design moves that decision to the statement boundary so savepoint semantics stay identical to pre-staging behavior.

Flush algorithm

executor/staging.rs performs:

TableEngine::insert_rows_batch_with_ctx(...)
batch_insert_into_indexes(...)
one CatalogWriter::update_index_root(...) per changed index
stats update

The current design still inserts index entries row-by-row inside the flush. That cost is explicit and remains the next insert-side optimization candidate if future profiling shows it dominates after staging.

ClusteredInsertBatch (Phase 40.1)

Phase 40.1 extends the PendingInsertBatch pattern to clustered (primary-key ordered) tables, eliminating the per-row CoW B-tree overhead that made clustered inserts 2× slower than heap inserts inside explicit transactions.

Root cause of the pre-40.1 gap

Before 40.1, every clustered INSERT inside an explicit transaction called apply_clustered_insert_rows immediately, which performs:

storage.read_page(root) — 16 KB page read
storage.write_page(new_root, page) — 16 KB CoW page write
WAL append
Secondary index write

For N = 50 000 rows that is 100 000 storage operations just for the base tree.

Data structures

#![allow(unused)]
fn main() {
// session.rs
pub struct StagedClusteredRow {
    pub values: Vec<Value>,
    pub encoded_row: Vec<u8>,
    pub primary_key_values: Vec<Value>,
    pub primary_key_bytes: Vec<u8>,
}

pub struct ClusteredInsertBatch {
    pub table_id: u32,
    pub table_def: TableDef,
    pub primary_idx: IndexDef,
    pub secondary_indexes: Vec<IndexDef>,
    pub secondary_layouts: Vec<ClusteredSecondaryLayout>,
    pub compiled_preds: Vec<Option<Expr>>,
    pub rows: Vec<StagedClusteredRow>,
    pub staged_pks: HashSet<Vec<u8>>,   // O(1) intra-batch PK dedup
}
}

StagedClusteredRow is structurally identical to PreparedClusteredInsertRow (defined in clustered_table.rs) but lives in session.rs to avoid a circular dependency: clustered_table.rs imports SessionContext.

Enqueue path (`enqueue_clustered_insert_ctx`)

For each row in the VALUES list:

Evaluate expressions, expand columns, assign AUTO_INCREMENT.
Validate CHECK constraints and FK child references.
Encode via prepare_row_with_ctx (coerce + PK extract + row codec).
Check staged_pks — return UniqueViolation and discard batch on intra-batch PK duplicate.
Push StagedClusteredRow and insert PK bytes into staged_pks.

Committed-data PK duplicates are caught at flush time by lookup_physical inside apply_clustered_insert_rows (same as the pre-40.1 single-statement path).

Flush path (`flush_clustered_insert_batch`)

1. Sort staged rows ascending by pk_bytes
   → enables append-biased detection in apply_clustered_insert_rows
2. Convert StagedClusteredRow → PreparedClusteredInsertRow (field move)
3. Call apply_clustered_insert_rows (existing function):
     a. detect append-biased pattern (all PKs increasing)
     b. loop: try_insert_rightmost_leaf_batch → fast O(leaf_capacity) write
        fallback: single-row clustered_tree::insert
     c. WAL record_clustered_insert per row
     d. maintain_clustered_secondary_inserts per row
     e. persist changed roots
4. ctx.stats.on_rows_changed + ctx.invalidate_all

🚀

55.9K rows/s vs MySQL 8.0 InnoDB 35K rows/s For 50K sequential PK rows in one explicit transaction, AxiomDB 40.1 delivers 55.9K rows/s — +59% faster than MySQL 8.0 InnoDB's ~35K rows/s. MySQL's buffer pool amortizes page writes across multiple connections; AxiomDB's batch achieves the same effect for a single connection by deferring all writes to flush time and using try_insert_rightmost_leaf_batch to fill each leaf page once.

Barrier detection

should_flush_clustered_batch_before_stmt returns false only when the next statement is a VALUES INSERT into the same clustered table (batch continues). For all other statements, the batch is flushed before dispatch. This mirrors the existing should_flush_pending_inserts_before_stmt logic for heap tables.

ROLLBACK discards the batch via discard_clustered_insert_batch() — no storage writes, no WAL entries, no undo needed.

CREATE INDEX on clustered tables (Phase 40.1b)

execute_create_index (ddl.rs) now handles both heap and clustered tables with a single function. The dispatch happens after the B-Tree root page is allocated:

if table_def.is_clustered() {
    primary_idx  ← CatalogReader::list_indexes → find(is_primary)
    preview_def  ← IndexDef { columns, is_unique, fillfactor, root_page_id, … }
    layout       ← ClusteredSecondaryLayout::derive(&preview_def, &primary_idx)
    rows         ← scan_clustered_table(storage, &table_def, &col_defs, snap)
    for row in rows:
        if partial predicate → skip non-matching rows
        entry = layout.entry_from_row(row)  → physical_key for bloom
        layout.insert_row(storage, &root_pid, row)  → uniqueness + B-Tree insert
} else {
    rows ← scan_table(…)
    for row in rows: encode_index_key + BTree::insert_in
}
// step 8: stats bootstrap uses same `rows` Vec — no extra I/O

The ClusteredSecondaryLayout encodes the physical key as secondary_cols ++ suffix_primary_cols — exactly the format used by runtime INSERT/UPDATE/DELETE, so a clustered secondary index built by CREATE INDEX is byte-for-byte compatible with those written by the DML executors.

entry_from_row is called once per row to collect the physical key for the bloom filter, and insert_row calls it again internally for the B-Tree write. This is acceptable overhead during a DDL operation (O(n) with constant factor ≈2).

NULL handling

ClusteredSecondaryLayout::entry_from_row returns None when any secondary column is NULL. Both the bloom key collection and the B-Tree insert are skipped in that case, consistent with the runtime INSERT path and SQL standard NULL semantics for indexes.

Uniqueness enforcement

insert_row delegates uniqueness to ensure_unique_logical_key_absent, the same function used at runtime. If an existing row already carries that logical key, the build fails with DbError::UniqueViolation before the catalog entry is written.

Query Pipeline

SQL string
  → tokenize()         logos DFA, ~85 tokens, zero-copy &str
  → parse()            recursive descent, produces Stmt with col_idx = 0
  → analyze()          BindContext resolves every col_idx
  → execute()          dispatches to per-statement handler
      ├── scan_table   HeapChain::scan_visible + decode_row
      ├── filter       eval(WHERE, &row) + is_truthy
      ├── join         nested-loop, apply_join
      ├── aggregate    hash-based GroupState
      ├── sort         apply_order_by, compare_sort_values
      ├── deduplicate  apply_distinct, value_to_key_bytes
      ├── project      project_row / project_grouped_row
      └── paginate     apply_limit_offset
  → QueryResult::Rows / Affected / Empty

JOIN — Nested Loop

Phase 4 implements nested-loop joins. All tables are pre-scanned once before any loop begins — scanning inside the inner loop would re-read the same data O(n) times and could see partially-inserted rows.

Algorithm

scanned[0] = scan(FROM table)
scanned[1] = scan(JOIN[0] table)
...

combined_rows = scanned[0]
for each JoinClause in stmt.joins:
    combined_rows = apply_join(combined_rows, scanned[i+1], join_type, ON/USING)

`apply_join` per type

Join type	Behavior
`INNER` / `CROSS`	Emit combined row for each pair where ON is truthy
`LEFT`	Emit all left rows; unmatched left → right side padded with `NULL`
`RIGHT`	Emit all right rows; unmatched right → left side padded with `NULL`; uses a `matched_right: Vec<bool>` bitset
`FULL`	`NotImplemented` — Phase 4.8+

USING condition

USING(col_name) is resolved at execution time using left_schema: Vec<(name, col_idx)>, accumulated across all join stages. The condition combined[left_idx] == combined[right_idx] uses SQL equality — NULL = NULL returns UNKNOWN (false), so NULLs never match in USING.

⚙️

Design Decision — Pre-scan Before Loop All tables are scanned once before the nested-loop begins. This is the primary anti-pattern to avoid: scanning inside the inner loop re-reads data O(n) times and, for LEFT/RIGHT joins that modify the heap, can observe partially-inserted rows. Pre-scanning also enables the RIGHT JOIN bitset pattern, which requires knowing the total right-side row count upfront.

GROUP BY — Strategy Selection (Phase 4.9b)

The executor selects between two GROUP BY execution strategies at runtime:

Strategy	When selected	Behavior
`Hash`	Default; JOINs; derived tables; plain scans	HashMap per group key; `O(k)` memory
`Sorted { presorted: true }`	Single-table ctx path + compatible B-Tree index	Stream adjacent equal groups; `O(1)` memory

#![allow(unused)]
fn main() {
enum GroupByStrategy {
    Hash,
    Sorted { presorted: bool },
}
}

Strategy selection (choose_group_by_strategy_ctx) is only active on the single-table ctx path (execute_with_ctx). All JOIN, derived-table, and non-ctx paths use Hash.

Prefix Match Rule

The sorted strategy is selected when all four conditions hold:

Access method is IndexLookup, IndexRange, or IndexOnlyScan.
Every GROUP BY expression is a plain Expr::Column (no function calls, no aliases).
The column references match the leading key prefix of the chosen index in the same order.
The prefix length ≤ number of index columns.

Examples (index (region, dept)):

GROUP BY	Result
`region, dept`	✅ Sorted
`region`	✅ Sorted (prefix)
`dept, region`	❌ Hash (wrong order)
`LOWER(region)`	❌ Hash (computed expression)

This is correct because BTree::range_in guarantees rows arrive in key order, and equal leading prefixes are contiguous even with extra suffix columns or RID suffixes on non-unique indexes.

⚙️

Design Decision — Borrowed from PostgreSQL + DuckDB PostgreSQL keeps GroupAggregate (sorted) and HashAggregate as separate strategies selected at planning time (pathnodes.h). DuckDB selects the aggregation strategy at physical plan time based on input guarantees. AxiomDB borrows the two-strategy concept but selects at execution time using the already-chosen access method — no separate planner pass needed.

GROUP BY — Hash Aggregation

GROUP BY uses a single-pass hash aggregation strategy: one scan through the filtered rows, accumulating aggregate state per group key.

Specialized Hash Tables (subfase 39.21)

Two hash table types avoid generic dispatch overhead:

GroupTablePrimitive — single-column GROUP BY on integer-like values (INT, BIGINT, DOUBLE, Bool). Maps i64 → GroupEntry via hashbrown::HashMap<i64, usize>. No key serialization needed; comparison is a single integer equality check.
GroupTableGeneric — multi-column GROUP BY, TEXT columns, mixed types, and the global no-GROUP-BY case. Serializes group keys into a Vec<u8> reused across rows (zero allocation when capacity fits), maps &[u8] → GroupEntry via hashbrown::HashMap<Box<[u8]>, usize>.

Both tables store entries in a Vec<GroupEntry> and use the hash maps as index structures. This keeps entries contiguous in memory and avoids pointer chasing during the accumulation loop.

🚀

Performance Advantage — hashbrown vs std HashMap hashbrown (the same table backing Rust's std::HashMap) uses SIMD-accelerated quadratic probing (SSE2/NEON). For a 62-group workload over 50K rows, this cuts probe overhead by ~30% vs a naïve open-addressing table. The specialized GroupTablePrimitive path avoids serialization entirely, reducing per-row work to one integer hash + one equality check.

Group Key Serialization

Value contains f64 which does not implement Hash in Rust. AxiomDB uses a custom self-describing byte serialization instead of the row codec:

value_to_key_bytes(Value::Null)        → [0x00]
value_to_key_bytes(Value::Int(n))      → [0x02, n as 4 LE bytes]
value_to_key_bytes(Value::Text(s))     → [0x06, len as 4 LE bytes, UTF-8 bytes]
...

Two NULL values produce identical bytes [0x00] → they form one group. This matches SQL GROUP BY semantics: NULLs are considered equal for grouping (unlike NULL = NULL in comparisons, which is UNKNOWN).

The group key for a multi-column GROUP BY is the concatenation of all column serializations. The key_buf: Vec<u8> is allocated once before the scan loop and reused (with clear() + extend_from_slice) for every row, so multi-column GROUP BY does not allocate per row for the probe step.

GroupEntry

Each unique group key maps to a GroupEntry:

#![allow(unused)]
fn main() {
struct GroupEntry {
    key_values: Vec<Value>,        // GROUP BY expression results (for output)
    non_agg_col_values: Vec<Value>, // non-aggregate SELECT cols (for HAVING/output)
    accumulators: Vec<AggAccumulator>,
}
}

non_agg_col_values is a sparse slice: only columns referenced by non-aggregate SELECT items or HAVING expressions are stored. Their indices are pre-computed once (compute_non_agg_col_indices) before the scan loop and reused for every group.

⚙️

Design Decision — non_agg_col_values vs representative_row Earlier versions stored representative_row: Row — the full first source row per group — to resolve HAVING column references. This costs one full Vec<Value> clone per group, regardless of how many columns HAVING actually needs. non_agg_col_values stores only the columns referenced by non-aggregate SELECT items and HAVING, computed once before the scan loop. For a 6-column table where HAVING references 1 column, this reduces per-group memory by ~83%.

Aggregate Accumulators

Aggregate	Accumulator	NULL behavior
`COUNT(*)`	`u64` counter	Increments for every row
`COUNT(col)`	`u64` counter	Skips rows where `col` is NULL
`SUM(col)`	`Option<Value>`	Skips NULL; `None` if all rows are NULL
`MIN(col)`	`Option<Value>`	Skips NULL; tracks running minimum
`MAX(col)`	`Option<Value>`	Skips NULL; tracks running maximum
`AVG(col)`	`(sum: Value, count: u64)`	Skips NULL; final = `sum / count` as Real

AVG always returns Real (SQL standard), even for integer columns. This avoids integer truncation (MySQL-style AVG(INT) returns DECIMAL but truncates in many contexts). AVG of all-NULL rows returns NULL.

Fast-path arithmetic (value_agg_add): For SUM, MIN, MAX, and COUNT, the accumulator is updated via direct arithmetic on Value variants, bypassing eval(). This eliminates the expression evaluator overhead for the innermost loop of the aggregate scan.

Ungrouped Aggregates

SELECT COUNT(*) FROM t (no GROUP BY) is handled as a single-group query with an empty key. Even on an empty table, the executor emits exactly one output row — (0) for COUNT(*), NULL for SUM/MIN/MAX/AVG. This matches the SQL standard and every major database.

Column Decode Mask

Before scanning, collect_expr_columns walks all expressions in SELECT items, WHERE, GROUP BY, HAVING, and ORDER BY to build a Vec<bool> mask indexed by column position. Only columns with mask[i] == true are decoded from the row bytes. For a SELECT age, AVG(score) FROM users GROUP BY age query on a 6-column table, this skips decoding name and email (TEXT fields) entirely.

The mask is forwarded to scan_clustered_table_masked as Option<&[bool]> and passed into decode_row_masked at the codec level, which skips variable-length fields that are not needed.

GROUP BY — Sorted Streaming Executor (Phase 4.9b)

The sorted executor replaces the hash table with a single linear pass over pre-ordered rows, accumulating state for the current group and emitting it when the key changes.

Algorithm

rows_with_keys = [(row, eval(group_by exprs, row)) for row in combined_rows]

if !presorted:
    stable_sort rows_with_keys by compare_group_key_lists

current_key   = rows_with_keys[0].key_values
current_accumulators = AggAccumulator::new() for each aggregate
update accumulators with rows_with_keys[0].row

for next in rows_with_keys[1..]:
    if group_keys_equal(current_key, next.key_values):
        update accumulators with next.row
    else:
        finalize → apply HAVING → emit output row
        reset: current_key = next.key_values, new accumulators, update

finalize last group

Key Comparison

#![allow(unused)]
fn main() {
fn compare_group_key_lists(a: &[Value], b: &[Value]) -> Ordering
fn group_keys_equal(a: &[Value], b: &[Value]) -> bool
}

Uses compare_values_null_last so NULL == NULL for grouping (consistent with the hash path’s serialization). Comparison is left-to-right: returns the first non-Equal ordering.

Shared Aggregate Machinery

Both hash and sorted executors reuse the same:

AggAccumulator (state, update, finalize)
eval_with_aggs (HAVING evaluation)
project_grouped_row (output projection)
build_grouped_column_meta (column metadata)
GROUP_CONCAT handling
Post-group DISTINCT / ORDER BY / LIMIT

🚀

Memory Advantage When the index already orders rows by the GROUP BY key prefix, the sorted executor uses O(1) accumulator memory (one group at a time) instead of O(k) where k = distinct groups. For a high-cardinality column with many distinct values, this eliminates the entire hash table allocation.

ORDER BY — Multi-Column Sort

ORDER BY is applied after scan + filter + aggregation but before projection for non-GROUP BY queries. For GROUP BY queries, it is applied to the projected output rows after remap_order_by_for_grouped rewrites column references.

ORDER BY in GROUP BY Context — Expression Remapping

Grouped output rows are indexed by SELECT output position: position 0 = first SELECT item, position 1 = second, etc. ORDER BY expressions, however, are analyzed against the source schema where Expr::Column { col_idx } refers to the original table column.

remap_order_by_for_grouped fixes this mismatch before calling apply_order_by:

remap_order_by_for_grouped(order_by, select_items):
  for each ORDER BY item:
    rewrite expr via remap_expr_for_grouped(expr, select_items)

remap_expr_for_grouped(expr, select_items):
  if expr == select_items[pos].expr (structural PartialEq):
    return Column { col_idx: pos }   // output position
  match expr:
    BinaryOp → recurse into left, right
    UnaryOp  → recurse into operand
    IsNull   → recurse into inner
    Between  → recurse into expr, low, high
    Function → recurse into args
    other    → return unchanged

This means ORDER BY dept (where dept is Expr::Column{col_idx:1} in the source) becomes Expr::Column{col_idx:0} when the SELECT is SELECT dept, COUNT(*), correctly indexing into the projected output row.

Aggregate expressions like ORDER BY COUNT(*) are matched structurally: if Expr::Function{name:"count", args:[]} appears in the SELECT at position 1, it is rewritten to Expr::Column{col_idx:1}.

⚙️

Design Decision — Expr PartialEq matching Rather than maintaining a separate alias/position resolution table, AxiomDB uses structural PartialEq on Expr (which is derived) to identify ORDER BY expressions that match SELECT items. This is simpler than PostgreSQL's SortClause/TargetEntry reference system and correct for the common cases (column references, aggregates, compound expressions).

NULL Ordering Defaults (PostgreSQL-compatible)

Direction	Default	Override
`ASC`	NULLs LAST	`NULLS FIRST`
`DESC`	NULLs FIRST	`NULLS LAST`

compare_sort_values(a, b, direction, nulls_override):
  nulls_first = explicit_nulls_order OR (DESC && no explicit)
  if a = NULL and b = NULL → Equal
  if a = NULL → Less if nulls_first, else Greater
  if b = NULL → Greater if nulls_first, else Less
  otherwise → compare a and b, reverse if DESC

Non-NULL comparison delegates to eval(BinaryOp{Lt}, Literal(a), Literal(b)) via the expression evaluator, reusing all type coercion and promotion logic.

Error Propagation from sort_by

Rust’s sort_by closure cannot return Result. AxiomDB uses the sort_err pattern: errors are captured in Option<DbError> during the sort and returned after it completes.

#![allow(unused)]
fn main() {
let mut sort_err: Option<DbError> = None;
rows.sort_by(|a, b| {
    match compare_rows_for_sort(a, b, order_items) {
        Ok(ord) => ord,
        Err(e)  => { sort_err = Some(e); Equal }
    }
});
if let Some(e) = sort_err { return Err(e); }
}

DISTINCT — Deduplication

SELECT DISTINCT is applied after projection and before LIMIT/OFFSET, using a HashSet<Vec<u8>> keyed by value_to_key_bytes.

fn apply_distinct(rows: Vec<Row>) -> Vec<Row>:
    seen = HashSet::new()
    for row in rows:
        key = concat(value_to_key_bytes(v) for v in row)
        if seen.insert(key):   // first occurrence
            keep row

Two rows are identical if every column value serializes to the same bytes. Critically, NULL → [0x00] means two NULLs are considered equal for deduplication — only one row with a NULL in that position is kept. This is the SQL standard behavior for DISTINCT, and is different from equality comparison where NULL = NULL returns UNKNOWN.

LIMIT / OFFSET — Row-Count Coercion (Phase 4.10d)

apply_limit_offset runs after ORDER BY and DISTINCT. It calls eval_row_count_as_usize for each row-count expression.

Row-count coercion contract

Evaluated value	Result
`Int(n)` where `n ≥ 0`	`n as usize`
`BigInt(n)` where `n ≥ 0`	`usize::try_from(n)` — errors on overflow
`Text(s)` where `s.trim()` parses as an exact base-10 integer `≥ 0`	parsed value as `usize`
negative `Int` or `BigInt`	`DbError::TypeMismatch`
non-integral `Text` (`"10.1"`, `"1e3"`, `"abc"`)	`DbError::TypeMismatch`
`NULL`, `Bool`, `Real`, `Decimal`, `Date`, `Timestamp`	`DbError::TypeMismatch`

Text coercion is intentionally narrow: only exact base-10 integers are accepted. Scientific notation, decimal fractions, and time-like strings are all rejected.

Why Text is accepted

The prepared-statement SQL-string substitution path serializes a Value::Text("2") parameter as LIMIT '2' in the generated SQL. Without Text coercion, the fallback path would always fail for string-bound LIMIT parameters — which is the binding type used by some MariaDB clients. Accepting exact integer Text keeps the cached-AST prepared path and the SQL-string fallback path on identical semantics.

⚙️

Design Decision AxiomDB does not call the general coerce() function here. coerce() uses assignment-coercion semantics and would change the error class to InvalidCoercion, masking the semantic error. eval_row_count_as_usize implements the narrower 4.10d contract directly in the executor, keeping the error class and message family consistent for both prepared paths.

INSERT … SELECT — MVCC Isolation

INSERT INTO target SELECT ... FROM source executes the SELECT phase under the same snapshot as any other read in the transaction — fixed at BEGIN.

This prevents the “Halloween problem”: rows inserted by this INSERT have txn_id_created = current_txn_id. The snapshot was taken before any insert occurred, so snapshot_id ≤ current_txn_id. The MVCC visibility rule (txn_id_created < snapshot_id) causes newly inserted rows to be invisible to the SELECT scan. The result:

If source = target (inserting from a table into itself): the SELECT sees exactly the rows that existed at BEGIN. The inserted copies are not re-scanned. No infinite loop.
If another transaction inserts rows into source after this transaction’s BEGIN: those rows are also invisible (consistent snapshot).

Before BEGIN:  source = {row1, row2}
After BEGIN:   snapshot_id = 3  (max_committed = 2)

INSERT INTO source SELECT * FROM source:
  SELECT sees:  {row1 (xmin=1), row2 (xmin=2)} — both have xmin < snapshot_id ✅
  Inserts:      row3 (xmin=3), row4 (xmin=3) — xmin = current_txn_id = 3
  SELECT does NOT see row3 or row4 (xmin ≮ snapshot_id) ✅

After COMMIT:  source = {row1, row2, row3, row4}  ← exactly 2 new rows, not infinite

Subquery Execution

Subquery execution is integrated into the expression evaluator via the SubqueryRunner trait. This design allows the compiler to eliminate all subquery dispatch overhead in the non-subquery path at zero runtime cost.

SubqueryRunner Trait

#![allow(unused)]
fn main() {
pub trait SubqueryRunner {
    fn eval_scalar(&mut self, subquery: &SelectStmt) -> Result<Value, DbError>;
    fn eval_in(&mut self, subquery: &SelectStmt, needle: &Value) -> Result<Value, DbError>;
    fn eval_exists(&mut self, subquery: &SelectStmt) -> Result<bool, DbError>;
}
}

All expression evaluation is dispatched through eval_with<R: SubqueryRunner>:

#![allow(unused)]
fn main() {
pub fn eval_with<R: SubqueryRunner>(
    expr: &Expr,
    row: &Row,
    runner: &mut R,
) -> Result<Value, DbError>
}

Two concrete implementations exist:

Implementation	Purpose
`NoSubquery`	Used for simple expressions with no subqueries. All three `SubqueryRunner` methods are `unreachable!()`. Monomorphization guarantees they are dead code.
`ExecSubqueryRunner<'a>`	Used when the query contains at least one subquery. Holds mutable references to storage, the transaction manager, and the outer row for correlated access.

⚙️

Design Decision — Generic Trait Monomorphization Using SubqueryRunner as a generic trait parameter — rather than a runtime Option<&mut dyn FnMut> or a boolean flag — allows the compiler to generate two separate code paths: eval_with::<NoSubquery> and eval_with::<ExecSubqueryRunner>. In the NoSubquery path, every subquery branch is dead code and is eliminated by LLVM. A runtime option would add a pointer-width check plus a potential indirect call on every expression node evaluation, even for the 99% of expressions that have no subqueries.

Scalar Subquery Evaluation

ExecSubqueryRunner::eval_scalar executes the inner SelectStmt fully using the existing execute_select path, then inspects the result:

eval_scalar(subquery):
  result = execute_select(subquery, storage, txn)
  match result.rows.len():
    0     → Value::Null
    1     → result.rows[0][0]   // single column, single row
    n > 1 → Err(CardinalityViolation { returned: n })

The inner SELECT is always run with a fresh output context. It inherits the outer transaction snapshot so it sees the same consistent view as the outer query.

IN Subquery Evaluation

eval_in materializes the subquery result into a HashSet<Value>, then applies three-valued logic:

eval_in(subquery, needle):
  rows = execute_select(subquery)
  values: HashSet<Value> = rows.map(|r| r[0]).collect()

  if values.contains(needle):
    return Value::Bool(true)
  if values.contains(Value::Null):
    return Value::Null       // unknown — could match
  return Value::Bool(false)

For NOT IN, the calling code wraps the result: TRUE → FALSE, FALSE → TRUE, NULL → NULL (NULL propagates unchanged).

EXISTS Evaluation

eval_exists executes the subquery and checks whether the result set is non-empty. No rows are materialized beyond the first:

eval_exists(subquery):
  rows = execute_select(subquery)
  return !rows.is_empty()   // always bool, never null

Correlated Subqueries — `substitute_outer`

Before executing a correlated subquery, ExecSubqueryRunner walks the subquery AST and replaces every Expr::OuterColumn { col_idx, depth: 1 } with a concrete Expr::Literal(value) from the current outer row. This operation is called substitute_outer:

substitute_outer(expr_tree, outer_row):
  for each node in expr_tree:
    if node = OuterColumn { col_idx, depth: 1 }:
      replace with Literal(outer_row[col_idx])
    if node = OuterColumn { col_idx, depth: d > 1 }:
      decrement depth by 1  // pass through for deeper nesting

After substitution, the subquery is a fully self-contained statement with no outer references, and it is executed by the standard execute_select path.

Re-execution happens once per outer row: for a correlated EXISTS in a query that produces 10,000 outer rows, the inner query is executed 10,000 times. For large datasets, rewriting as a JOIN is recommended.

Derived Table Execution

A derived table (FROM (SELECT ...) AS alias) is materialized once at the start of query execution, before any scan or filter of the outer query begins:

execute_select(stmt):
  for each TableRef::Derived { subquery, alias } in stmt.from:
    materialized[alias] = execute_select(subquery)   // fully materialized in memory
  // outer query scans materialized[alias] as if it were a base table

The materialized result is an in-memory Vec<Row> wrapped in a MaterializedTable. The outer query uses the derived table’s output schema (column names from the inner SELECT list) for column resolution.

Derived tables are not correlated — they cannot reference columns from the outer query. Lateral joins (which allow correlation in FROM) are not yet supported.

Foreign Key Enforcement

FK constraints are validated during DML operations by crates/axiomdb-sql/src/fk_enforcement.rs.

Catalog Storage

Each FK is stored as a FkDef row in the axiom_foreign_keys heap (5th system table, root page at meta offset 84). Fields:

fk_id, child_table_id, child_col_idx, parent_table_id, parent_col_idx,
on_delete: FkAction, on_update: FkAction, fk_index_id: u32, name: String

FkAction encoding: 0=NoAction, 1=Restrict, 2=Cascade, 3=SetNull, 4=SetDefault. fk_index_id != 0 → FK auto-index exists (composite key, Phase 6.9). fk_index_id = 0 → no auto-index; enforcement falls back to full table scan.

FK auto-index — composite key `(fk_val | RecordId)` (Phase 6.9)

Each FK constraint auto-creates a B-Tree index on the child FK column using a composite key format that makes every entry globally unique:

key = encode_index_key(&[fk_val]) ++ encode_rid(rid)  (10 bytes RecordId suffix)

This follows InnoDB’s approach of appending the primary key as a tiebreaker (row0row.cc). Every entry is unique even when many rows share the same FK value.

Range scan for all children with a given parent key:

#![allow(unused)]
fn main() {
lo = encode_index_key(&[parent_key]) ++ [0x00; 10]  // smallest RecordId
hi = encode_index_key(&[parent_key]) ++ [0xFF; 10]  // largest RecordId
children = BTree::range_in(fk_index_root, lo, hi)   // O(log n + k)
}

INSERT / UPDATE child — `check_fk_child_insert`

For each FK on the child table:
  1. FK column is NULL → skip (MATCH SIMPLE)
  2. Encode FK value as B-Tree key
  3. Find parent's PK or UNIQUE index covering parent_col_idx
  4. Bloom shortcut: if filter says absent → ForeignKeyViolation immediately
  5. BTree::lookup_in(parent_index_root, key) — O(log n)
  6. No match → ForeignKeyViolation (SQLSTATE 23503)

PK indexes are populated on every INSERT since Phase 6.9 (removed !is_primary filter in insert_into_indexes). All index types now use B-Tree + Bloom lookup.

DELETE parent — `enforce_fk_on_parent_delete`

Called before the parent rows are deleted. For each FK referencing this table:

Action	Behavior
RESTRICT / NO ACTION	`BTree::range_in(fk_index)` — O(log n); error if any child found
CASCADE	Range scan finds all children; recursive delete (depth limit = 10)
SET NULL	Range scan finds all children; updates FK column to NULL

Cascade recursion uses depth parameter — exceeding 10 levels returns ForeignKeyCascadeDepth (SQLSTATE 23503).

🚀

Performance Advantage Phase 6.9 replaced full table scans with B-Tree range scans for FK enforcement. RESTRICT check: O(log n) vs O(n). CASCADE with 1,000 children: O(log n + 1000) vs O(n × 1000). This follows InnoDB's composite secondary index approach (`dict_foreign_t.foreign_index`) rather than PostgreSQL's trigger-based strategy.

Query Planner Cost Gate (Phase 6.10)

Before returning IndexLookup or IndexRange, plan_select applies a cost gate using per-column statistics to decide if the index scan is worth the overhead.

Algorithm

ndv = stats.ndv > 0 ? stats.ndv : DEFAULT_NUM_DISTINCT (= 200)
selectivity = 1.0 / ndv        // equality predicate: 1/ndv rows match
if selectivity > 0.20:
    return Scan                 // too many rows — full scan is cheaper
if stats.row_count < 1,000:
    return Scan                 // tiny table — index overhead not worth it
return IndexLookup / IndexRange // selective enough — use index

Constants derived from PostgreSQL:

INDEX_SELECTIVITY_THRESHOLD = 0.20 (PG default: seq/random_page_cost = 0.25; AxiomDB is slightly more conservative for embedded storage)
DEFAULT_NUM_DISTINCT = 200 (PG DEFAULT_NUM_DISTINCT in selfuncs.c)

Stats are loaded once per SELECT

In execute_select_ctx, before calling plan_select:

#![allow(unused)]
fn main() {
let table_stats = CatalogReader::new(storage, snap)?.list_stats(table_id)?;
let access_method = plan_select(where_clause, indexes, columns, table_id,
                                &table_stats, &mut ctx.stats);
}

If table_stats is empty (pre-6.10 database or ANALYZE never run), plan_select conservatively uses the index — never wrong, just possibly suboptimal.

Staleness (`StaleStatsTracker`)

StaleStatsTracker lives in SessionContext and tracks row changes per table:

INSERT / DELETE row  → on_row_changed(table_id)
changes > 20% of baseline  → mark stale
planner loads stats  → set_baseline(table_id, row_count)
ANALYZE TABLE        → mark_fresh(table_id)

When stale, the planner uses ndv = DEFAULT_NUM_DISTINCT = 200 regardless of catalog stats, preventing stale low-NDV estimates from causing full scans on high-selectivity columns.

Bloom Filter — Index Lookup Shortcut

The executor holds a BloomRegistry (one per database connection) that maps index_id → Bloom<Vec<u8>>. Before performing any B-Tree lookup for an index equality predicate, the executor consults the filter:

#![allow(unused)]
fn main() {
// In execute_select_ctx — IndexLookup path
if !bloom.might_exist(index_def.index_id, &encoded_key) {
    // Definite absence: skip B-Tree entirely.
    return Ok(vec![]);
}
// False positive or true positive: proceed with B-Tree.
BTree::lookup_in(storage, index_def.root_page_id, &encoded_key)?
}

BloomRegistry API

#![allow(unused)]
fn main() {
pub struct BloomRegistry { /* per-index filters */ }

impl BloomRegistry {
    pub fn create(&mut self, index_id: u32, expected_items: usize);
    pub fn add(&mut self, index_id: u32, key: &[u8]);
    pub fn might_exist(&self, index_id: u32, key: &[u8]) -> bool;
    pub fn mark_dirty(&mut self, index_id: u32);
    pub fn remove(&mut self, index_id: u32);
}
}

might_exist returns true (conservative) for unknown index_ids — correct behavior for indexes that existed before the current server session (no filter was populated for them at startup).

DML Integration

Every DML handler in the execute_with_ctx path updates the registry:

Handler	Bloom action
`execute_insert_ctx`	`bloom.add(index_id, &key)` after each B-Tree insert
`execute_update_ctx`	`mark_dirty()` for delete side (batch); `add()` for insert side
`execute_delete_ctx`	`mark_dirty(index_id)` once per index batch (5.19)
`execute_create_index`	`create(index_id, n)` then `add()` for every existing key
`execute_drop_index`	`remove(index_id)`

Memory Budget

Each filter is sized at max(2 × expected_items, 1000) with a 1% FPR target (~9.6 bits/key, 7 hash functions). For a 1M-row table with one secondary index: 2M × 9.6 bits ≈ 2.4 MB.

⚙️

Design Decision — Standard Bloom, Not Counting A standard (non-counting) Bloom filter is used instead of a counting variant. Deleted keys cannot be removed — the filter is marked dirty instead. This avoids the 4× memory overhead of counting Bloom filters (used by Apache Cassandra and some RocksDB SST configurations) while maintaining full correctness: dirty filters produce more false positives but never false negatives. Reconstruction is deferred to ANALYZE TABLE (Phase 6.12), mirroring PostgreSQL's lazy statistics-rebuild model.

IndexOnlyScan — Heap-Free Execution

When plan_select returns AccessMethod::IndexOnlyScan, the executor reads all result values directly from the B-Tree key bytes, with only a lightweight MVCC visibility check against the heap slot header.

This section applies to the heap executor path. Since 39.15, clustered tables do not execute this path directly even if the planner initially detects a covering opportunity. Clustered covering plans are normalized back to clustered-aware lookup/range access, because clustered visibility lives in the inline row header and clustered secondary indexes carry PK bookmarks instead of stable heap RecordIds.

⚙️

Design Decision — Heap-Only IndexOnlyScan PostgreSQL can keep one logical access method because every index probe still lands on a heap TID. Clustered storage breaks that assumption. AxiomDB keeps `IndexOnlyScan` as a heap optimization and routes clustered reads through the clustered tree until a real clustered covering-read path exists.

Clustered UPDATE (`39.16`)

Clustered tables no longer fall back to heap-era UPDATE logic. The executor now routes explicit-PRIMARY KEY tables through clustered candidate discovery and clustered rewrite primitives:

discover candidates through the clustered access planner: PK lookup, PK range, secondary bookmark probe, or full clustered scan
capture the exact old clustered row image (RowHeader + full logical row bytes) before any mutation
choose one of three clustered write paths:
- same-key in-place rewrite via clustered_tree::update_in_place(...)
- same-key relocation via clustered_tree::update_with_relocation(...)
- PK change via delete_mark(old_pk) + insert(new_pk, ...)
rewrite clustered secondary bookmark entries and register both index-insert and index-delete undo records so rollback can restore the old bookmark state

⚙️

Design Decision — InnoDB-Style Branching MariaDB/InnoDB splits UPDATE into clustered in-place vs. delete+insert depending on whether index-ordering columns change. AxiomDB now applies the same decision tree to clustered SQL UPDATE, but stores rollback state as exact row images keyed by primary key instead of heap-era slot addresses.

Clustered DELETE (`39.17`)

Clustered tables no longer fall back to heap-era DELETE logic either. The executor now routes explicit-PRIMARY KEY tables through clustered candidate discovery and clustered delete-mark primitives:

discover candidates through the clustered access planner: PK lookup, PK range, secondary bookmark probe, or full clustered scan
decode the exact current clustered row image before any mutation
enforce parent-side foreign-key restrictions before the first delete-mark
call clustered_tree::delete_mark(...) for each matched primary key
record EntryType::ClusteredDeleteMark with the exact old and new row images so rollback/savepoints restore the original header and payload bytes
leave clustered secondary bookmark entries in place for deferred cleanup during later clustered VACUUM work

⚙️

Design Decision — Delete Mark Before Purge InnoDB delete-marks clustered rows and purges them later; PostgreSQL also leaves tuple cleanup to VACUUM. AxiomDB now exposes the same split at SQL level: clustered DELETE changes visibility now, while physical removal and secondary cleanup stay in Phase 39.18.

Clustered VACUUM (`39.18`)

Clustered tables now have their own executor-visible maintenance path too. VACUUM table_name dispatches by table storage layout:

compute oldest_safe_txn = max_committed + 1
descend once to the leftmost clustered leaf
walk the next_leaf chain and remove cells whose txn_id_deleted is safe
free any overflow chain owned by each purged cell
defragment the leaf when freeblock waste exceeds the page-local threshold
scan each clustered secondary index, decode the PK bookmark from the physical secondary key, and keep only entries whose clustered row still exists physically after the leaf purge
persist any secondary root rotation caused by bulk delete back into the catalog

⚙️

Design Decision — Purge Is Not Visibility Using snapshot visibility directly for clustered secondary cleanup is wrong: an uncommitted delete is invisible to the writer snapshot but still owns its bookmark physically. `39.18` therefore cleans secondaries by clustered physical existence after purge, not by `lookup(..., snapshot)`.

Execution Path

IndexOnlyScan { index_def, lo, hi, n_key_cols, needed_key_positions }:

for (rid, key_bytes) in BTree::range_in(storage, index_def.root_page_id, lo, hi):
  page_id = rid.page_id
  slot_id = rid.slot_id

  // MVCC: read only the 24-byte RowHeader — no full row decode.
  visible = HeapChain::is_slot_visible(storage, page_id, slot_id, snap)
  if !visible:
    continue

  // Extract column values from B-Tree key bytes (no heap page needed).
  (decoded_cols, _) = decode_index_key(&key_bytes, n_key_cols)

  // Project only the columns the query requested.
  row = needed_key_positions.iter().map(|&p| decoded_cols[p].clone()).collect()
  emit row

The 24-byte RowHeader contains txn_id_created, txn_id_deleted, and a sequence number — enough for full MVCC visibility evaluation without loading the row payload.

`decode_index_key` — Self-Delimiting Key Decoder

decode_index_key lives in key_encoding.rs and is the exact inverse of encode_index_key. It uses type tags embedded in the key bytes to self-delimit each value without needing an external schema:

Tag byte	Type	Encoding
`0x00`	NULL	tag only, 0 payload bytes
`0x01`	Bool	tag + 1 byte (0 = false, 1 = true)
`0x02`	Int (positive, 1 B)	tag + 1 LE byte
`0x03`	Int (positive, 2 B)	tag + 2 LE bytes
`0x04`	Int (positive, 4 B)	tag + 4 LE bytes
`0x05`	Int (negative, 4 B)	tag + 4 LE bytes (i32)
`0x06`	BigInt (positive, 1 B)	tag + 1 byte
`0x07`	BigInt (positive, 4 B)	tag + 4 LE bytes
`0x08`	BigInt (positive, 8 B)	tag + 8 LE bytes
`0x09`	BigInt (negative, 8 B)	tag + 8 LE bytes (i64)
`0x0A`	Real	tag + 8 LE bytes (f64 bits)
`0x0B`	Text	tag + NUL-terminated UTF-8 (NUL = end marker)
`0x0C`	Bytes	tag + NUL-escaped bytes (0x00 → [0x00, 0xFF], NUL terminator = [0x00, 0x00])

#![allow(unused)]
fn main() {
// Signature
pub fn decode_index_key(key: &[u8], n_cols: usize) -> (Vec<Value>, usize)
// Returns: (decoded column values, total bytes consumed)
}

The self-delimiting format means decode_index_key requires no column type metadata — the tag bytes carry all necessary type information. This is the same approach used by SQLite’s record format and RocksDB’s comparator-encoded keys.

Full-Width Row Layout in IndexOnlyScan Output

IndexOnlyScan emits full-width rows — the same width as a heap row — with index key column values placed at their table col_idx positions and NULL everywhere else. This is required because downstream operators (WHERE re-evaluation, projection, expression evaluator) all address columns by their original table column index, not by SELECT output position.

table: (id INT [0], name TEXT [1], age INT [2], dept TEXT [3])
index: ON (age, dept)  ← covers col_idx 2 and 3

IndexOnlyScan emits: [NULL, NULL, <age_val>, <dept_val>]
                      col0  col1    col2         col3

If the executor placed decoded values at positions 0, 1, ... instead, a WHERE age > 25 re-evaluation would read col_idx=2 from a 2-element row and panic with ColumnIndexOutOfBounds. The full-width layout eliminates this class of error entirely.

`execute_with_ctx` — Required for IndexOnlyScan Selection

The planner selects IndexOnlyScan only when select_col_idxs (the set of columns touched by the query) is a subset of the index’s key columns. The select_col_idxs argument is supplied by execute_with_ctx; the simpler execute entry-point passes an empty slice, so IndexOnlyScan is never selected through it.

Test coverage for this path lives in crates/axiomdb-sql/tests/integration_index_only.rs — functions prefixed test_ctx_ use execute_with_ctx with real select_col_idxs and are the only tests that exercise the IndexOnlyScan access method end-to-end.

Non-Unique Secondary Index Key Format

Non-unique secondary indexes append a 10-byte RecordId suffix to every B-Tree key to guarantee uniqueness across all entries:

key = encode_index_key(col_vals) || encode_rid(rid)
                                    ^^^^^^^^^^^^^^
                                    page_id (4 B) + slot_id (2 B) + seq (4 B) = 10 bytes

This prevents DuplicateKey errors when two rows share the same indexed value, because the RecordId suffix always makes the full key distinct.

Lookup Bounds for Non-Unique Indexes

To find all rows matching a specific indexed value, the executor performs a range scan using synthetic [lo, hi] bounds that span all possible RecordId suffixes:

#![allow(unused)]
fn main() {
lo = encode_index_key(&[val]) + [0x00; 10]   // smallest RecordId
hi = encode_index_key(&[val]) + [0xFF; 10]   // largest RecordId
BTree::range_in(root, lo, hi)                // returns all entries for val
}

⚙️

Design Decision — InnoDB Secondary Index Approach MySQL InnoDB secondary indexes append the primary key as a tiebreaker in every non-unique B-Tree entry (row0row.cc). AxiomDB uses RecordId (page_id + slot_id + sequence) instead of a separate primary key column, keeping the suffix at a fixed 10 bytes regardless of the table's key type — simpler to encode and guaranteed to be globally unique within the storage engine's address space.

Performance Characteristics

Operation	Time complexity	Notes
Table scan	O(n)	HeapChain linear traversal
Nested loop JOIN	O(n × m)	Both sides materialized before loop
Hash GROUP BY	O(n)	One pass; O(k) memory where k = distinct groups
Sorted GROUP BY	O(n)	One pass; O(1) accumulator memory per group
Sort ORDER BY	O(n log n)	`sort_by` (stable, in-memory)
DISTINCT	O(n)	One HashSet pass
LIMIT/OFFSET	O(1) after sort	`skip(offset).take(limit)`

All operations are in-memory for Phase 4. External sort and hash spill for large datasets are planned for Phase 14 (vectorized execution).

AUTO_INCREMENT Execution

Per-Table Sequence State

Each table that has an AUTO_INCREMENT column maintains a sequence counter. The counter is stored as a thread-local HashMap<String, i64> keyed by table name, lazily initialized on the first INSERT:

auto_increment_next(table_name):
  if table_name not in thread_local_map:
    max_existing = MAX(id) from HeapChain scan, or 0 if table is empty
    thread_local_map[table_name] = max_existing + 1
  value = thread_local_map[table_name]
  thread_local_map[table_name] += 1
  return value

The MAX+1 lazy-init strategy means the sequence is always consistent with existing data, even after rows are inserted by a previous session or after a crash recovery.

⚙️

Design Decision — Thread-Local vs Per-Session State The sequence counter is stored in thread-local storage rather than attached to a session object. Phase 4 uses a single-threaded executor, so thread-local and session-local are equivalent. This avoids the complexity of a session handle threading through every call site. When Phase 7 introduces concurrent sessions, the counter will migrate to per-session state. The lazy-init from MAX+1 is compatible with either approach.

Explicit Value Bypass

When the INSERT column list includes the AUTO_INCREMENT column with a non-NULL value, the explicit value is used directly and the sequence counter is not advanced:

for each row to insert:
  if auto_increment_col in provided_columns:
    value = provided value   // bypass — no counter update
  else:
    value = auto_increment_next(table_name)
    session.last_insert_id = value   // update only for generated IDs

LAST_INSERT_ID() is updated only when a value is auto-generated. Inserting an explicit ID does not change the session’s last_insert_id.

Multi-Row INSERT

For INSERT INTO t VALUES (...), (...), ..., the executor calls auto_increment_next once per row. last_insert_id is set to the value generated for the first row before iterating through the rest:

ids = [auto_increment_next(t) for _ in rows]
session.last_insert_id = ids[0]   // MySQL semantics
insert all rows with their respective ids

TRUNCATE — Sequence Reset

TRUNCATE TABLE t deletes all rows by scanning the HeapChain and marking every visible row as deleted (same algorithm as DELETE FROM t without a WHERE clause). After clearing the rows, it resets the sequence:

execute_truncate(table_name):
  for row in HeapChain::scan_visible(table_name, snapshot):
    storage.delete_row(row.record_id, txn_id)
  thread_local_map.remove(table_name)   // next insert re-initializes from MAX+1 = 1
  return QueryResult::Affected { count: 0 }

Removing the entry from the map forces a MAX+1 re-initialization on the next INSERT. Because the table is now empty, MAX = 0, so next = 1.

SHOW TABLES / SHOW COLUMNS

SHOW TABLES

SHOW TABLES [FROM schema] reads the catalog’s table registry and returns one row per table. The output column is named Tables_in_<schema>:

execute_show_tables(schema):
  tables = catalog.list_tables(schema)
  column_name = "Tables_in_" + schema
  return QueryResult::Rows { columns: [column_name], rows: [[t] for t in tables] }

SHOW COLUMNS / DESCRIBE

SHOW COLUMNS FROM t, DESCRIBE t, and DESC t are all dispatched to the same handler. The executor reads the column definitions from the catalog and constructs a fixed six-column result set:

execute_show_columns(table_name):
  cols = catalog.get_table(table_name).columns
  for col in cols:
    Field   = col.name
    Type    = col.data_type.to_sql_string()
    Null    = if col.nullable { "YES" } else { "NO" }
    Key     = if col.is_primary_key { "PRI" } else { "" }
    Default = "NULL"   // stub
    Extra   = if col.auto_increment { "auto_increment" } else { "" }
  return six-column result set

The Key and Default fields are stubs: Key only reflects primary key membership; composite keys, unique constraints, and foreign keys are not yet surfaced. Default always shows "NULL" regardless of the declared default expression. Full metadata exposure is planned for a later catalog enhancement.

ALTER TABLE Execution

ALTER TABLE dispatches to one of five handlers depending on the operation. Three of them (ADD COLUMN, DROP COLUMN, and MODIFY COLUMN) require rewriting every row in the table. The other two (RENAME COLUMN and RENAME TO) touch only the catalog.

Why Row Rewriting Is Needed

AxiomDB rows are stored as positional binary blobs. The null bitmap at the start of each row has exactly ceil(column_count / 8) bytes — one bit per column, in column-index order. Packed values follow immediately, with offsets derived from the column types declared at write time.

Row layout (schema: id BIGINT, name TEXT, age INT):

  null_bitmap (1 byte)   [b0=id_null, b1=name_null, b2=age_null, ...]
  id   (8 bytes, LE i64) [only present if b0=0]
  name (4-byte len + UTF-8 bytes) [only present if b1=0]
  age  (4 bytes, LE i32) [only present if b2=0]

When the column count changes, the null bitmap size changes and all subsequent offsets shift. A row written under the old schema cannot be decoded against the new schema — the null bitmap has the wrong number of bits, and value positions no longer align. Every row must therefore be rewritten to match the new layout.

RENAME COLUMN does not change column positions or types — only the name entry in the catalog changes. RENAME TO changes only the table name in the catalog. Neither operation touches row data.

`rewrite_rows` Helper

ADD COLUMN, DROP COLUMN, and MODIFY COLUMN all use a shared rewrite_rows dispatch. The implementation branches on storage format:

Heap tables:

rewrite_rows(table_name, old_schema, new_schema, transform_fn):
  snapshot = txn.active_snapshot()
  old_rows = HeapChain::scan_visible(table_name, snapshot)

  for (record_id, old_row) in old_rows:
    new_row = transform_fn(old_row)?   // apply per-operation transformation
    storage.delete_row(record_id, txn_id)
    storage.insert_row(table_name, encode_row(new_row, new_schema), txn_id)

Clustered tables (rewrite_rows_clustered):

Clustered tables cannot use heap delete+reinsert because clustered_tree::insert rejects duplicate primary keys even when the previous row is delete-marked. Instead, each row is rewritten in place using update_with_relocation:

rewrite_rows_clustered(table_id, old_schema, new_schema, transform_fn):
  snapshot = txn.active_snapshot()
  rows = clustered_tree::range(table_id, Unbounded, Unbounded, snapshot)

  for ClusteredRow { key, row_header, row_data } in rows:
    old_row = decode_row(row_data, old_schema)
    new_row = transform_fn(old_row)?
    new_data = encode_row(new_row, new_schema)
    txn.record_clustered_update(table_id, key, row_header+row_data, new_data)
    new_root = clustered_tree::update_with_relocation(key, new_data)
    if let Some(new_root_pid) = new_root {
        catalog.set_root_page(table_id, new_root_pid)
    }

update_with_relocation tries an in-place rewrite of the leaf slot. If the new row is larger and the leaf page is full, it falls back to physical delete + reinsert at the correct leaf position (no duplicate-key issue because the old entry is physically removed before the new one is inserted).

The transform_fn is operation-specific and returns Result<Row, DbError> so coercion failures abort the entire statement:

Operation	transform_fn
ADD COLUMN	Append `DEFAULT` value (or `NULL` if no default) to the end of the row
DROP COLUMN	Remove the value at `col_idx` from the row vector
MODIFY COLUMN	Replace value at `col_idx` with `coerce(value, new_type, Strict)?`

Ordering Constraint — Catalog Before vs. After Rewrite

The ordering of the catalog update relative to the row rewrite is not arbitrary. It is chosen so that a failure mid-rewrite leaves the database in a recoverable state:

ADD COLUMN — catalog update FIRST, then rewrite rows:

1. catalog.add_column(table_name, new_column_def)
2. rewrite_rows(old_schema → new_schema, append DEFAULT)

If the process crashes after step 1 but before step 2 completes, the catalog already reflects the new schema. The partially-rewritten rows are discarded by crash recovery (their transactions are uncommitted). On restart, the table is consistent: the new column exists in the catalog, and all rows either have been fully rewritten (if the transaction committed) or none have been (if it was rolled back).

DROP COLUMN — rewrite rows FIRST, then update catalog:

1. rewrite_rows(old_schema → new_schema, remove col at col_idx)
2. catalog.remove_column(table_name, col_idx)

If the process crashes after step 1 but before step 2, the rows have already been written in the new (narrower) layout but the catalog still shows the old schema. Recovery rolls back the uncommitted row rewrites and the catalog is never touched — the table is fully consistent under the old schema.

MODIFY COLUMN — rewrite rows FIRST (with strict coercion), then update catalog:

1. Guard: column not in secondary index (type change would break key encoding)
2. Guard: PK column cannot become nullable on clustered table
3. rewrite_rows(old_schema → new_schema, coerce(val, new_type, Strict)?)
4. catalog.delete_column(table_id, col_idx)
5. catalog.create_column(new_ColumnDef)  // same col_idx, new type/nullable

If coercion fails for any row (e.g. TEXT → INT on a non-numeric value), the error is returned immediately and no rows are changed. The statement is atomic: either all rows are coerced successfully or none are.

The invariant is: the catalog always describes rows that can be decoded. Swapping the order for either operation would create a window where the catalog describes a schema that does not match the on-disk rows.

⚙️

Design Decision — Asymmetric Catalog Ordering ADD COLUMN updates the catalog before rewriting rows; DROP COLUMN and MODIFY COLUMN rewrite rows before updating the catalog. The direction is chosen so that a mid-operation crash always leaves the catalog consistent with whatever rows are on disk — partial rewrites are uncommitted transactions invisible to crash recovery. This mirrors the ordering used in PostgreSQL's heap rewrite path for ALTER TABLE.

⚙️

Design Decision — Clustered DDL Uses update_with_relocation Clustered tables cannot use heap-style delete+reinsert during row rewrites because clustered_tree::insert rejects duplicate primary keys even when the previous entry is delete-marked. Instead, rewrite_rows_clustered uses update_with_relocation: it rewrites the leaf slot in place, falling back to physical relocate-and-reinsert only when the new row is larger and the leaf has no room. This avoids the duplicate-key restriction entirely and keeps the PK-keyed B+ tree consistent throughout the rewrite.

Session Cache Invalidation

The session holds a SchemaCache that maps table names to their column definitions at the time the last query was prepared. After any ALTER TABLE operation completes, the cache entry for the affected table is invalidated:

execute_alter_table(stmt):
  // ... perform operation (catalog update + optional row rewrite) ...
  session.schema_cache.invalidate(table_name)

This ensures that the next query against the altered table re-reads the catalog and sees the updated column list, rather than operating on a stale schema that may reference columns that no longer exist or omit newly added ones.

Index root invalidation on B+tree split

The SchemaCache also stores IndexDef.root_page_id for each index. When an INSERT causes the B+tree root to split, insert_in allocates a new root page and frees the old one. After this, the cached root_page_id points to a freed page. If the cache is not invalidated, the next execute_insert_ctx call reads IndexDef.root_page_id from the cache and passes it to BTree::lookup_in (uniqueness check), causing a stale-pointer read on a freed page.

The fix: call ctx.invalidate_all() whenever any index root changes during INSERT or DELETE index maintenance. This forces re-resolution from the catalog (which always has the current root_page_id) on the next DML statement.

Since 5.19, DELETE and the old-key half of UPDATE no longer mutate indexes in a per-row loop. They collect exact encoded keys per index, sort them, and call delete_many_in(...) once per affected tree. The cache-invalidation rule still matters, but the synchronization point moved:

batch-delete old keys per index
persist the final root once for that index
update the in-memory current_indexes slice
invalidate the session cache once after the statement

For UPDATE there is a second root-sync point: after the batch delete phase, the reinsertion half must start from the post-delete root, not from the stale root captured before the batch. Otherwise reinserting new keys after a root collapse would descend from a freed page.

#![allow(unused)]
fn main() {
// DELETE / UPDATE old-key batch
let updated = delete_many_from_indexes(...)?;
for (index_id, new_root) in updated {
    catalog.update_index_root(index_id, new_root)?;
    current_indexes[i].root_page_id = new_root;
}

// UPDATE new-key insert phase
let ins_updated = insert_into_indexes(&current_indexes, ...)?;
}

⚙️

Borrowed Bulk-Delete Principle `5.19` follows the same high-level rule used by PostgreSQL's nbtree VACUUM path and InnoDB bulk helpers: when many exact keys from one index are already known, delete them page-locally in one ordered pass instead of re-entering the point-delete path N times.

Stable-RID UPDATE Fast Path (`5.20`)

5.19 removed the old-key delete bottleneck, but UPDATE still paid the full heap delete+insert path even when the new row could fit in the existing slot. 5.20 adds a second branch:

for each matched row:
  old_row = ...
  new_row = apply_set_assignments(old_row)

  if encoded(new_row) fits in old slot:
      rewrite tuple in place
      rid stays identical
      only maintain indexes whose logical key/predicate membership changed
  else:
      fallback to delete + insert
      rid changes
      treat affected indexes as before

The heap rewrite path is page-grouped. Rows that share a heap page are batched so the executor reads the page once, rewrites all eligible slots, then writes the page once. WAL records this branch as EntryType::UpdateInPlace, storing the old and new tuple images for the same (page_id, slot_id).

This does not implement PostgreSQL HOT chains or forwarding pointers. The Phase 5 rule is narrower and cheaper to reason about: same-slot rewrite only, otherwise fall back to the existing delete+insert path.

⚙️

Borrowed HOT-Lite Rule PostgreSQL HOT and DuckDB's in-place updates both rely on preserving row identity whenever the storage layout allows it. AxiomDB adapts the same idea without adding version chains: if the new encoded row still fits in the old slot, keep the RID and skip untouched indexes safely.

Clustered UPDATE In-Place Zero-Alloc Fast Path (Phase 39.22)

fused_clustered_scan_patch in executor/update_fused_range.rs implements a zero-allocation UPDATE fast path for clustered tables when all SET columns are fixed-size.

Allocation audit

Allocation	Before 39.22	After 39.22
`cell.row_data.to_vec()` (phase-1 offset scan)	1× per matched row	❌ eliminated
`patched_data = ...clone()` (phase-2 mutation)	1× per matched row	❌ eliminated
`encode_cell_image()` in overflow path	1× per matched row	✅ overflow-only
`FieldDelta.old_bytes: Vec<u8>`	1× per changed field	❌ → `[u8;8]` inline
`FieldDelta.new_bytes: Vec<u8>`	1× per changed field	❌ → `[u8;8]` inline

For 25K rows with 1 changed column each: ~125K heap allocations → 0.

Two-phase borrow pattern

The Rust borrow checker requires releasing the immutable page borrow before taking a mutable one. The fast path uses a split-phase approach:

Read phase (immutable borrow on page):
  1. cell_row_data_abs_off(&page, idx) → (row_data_abs_off, key_len)
  2. compute_field_location_runtime(row_slice, bitmap) → FieldLocation
  3. MAYBE_NOP: if page_bytes[field_abs..][..size] == new_encoded[..size] { skip }
  4. Capture old_buf: [u8;8] and new_encoded: [u8;8] on the stack

Write phase (mutable borrow after immutable dropped):
  5. patch_field_in_place(&mut page, field_abs, new_buf[..size])
  6. update_row_header_in_place(&mut page, idx, &new_header)

MAYBE_NOP (byte-identity check)

If the new encoded bytes are byte-identical to the existing page bytes (e.g., SET score = score * 1 after integer multiplication), the field is skipped entirely — no WAL delta, no header bump, no page write for that field. This is an O(size) byte comparison (~4–8 bytes) before any mutation.

Overflow fallback

Cells with overflow_first_page.is_some() are rare (<1% of typical workloads) and fall back to the existing rewrite_cell_same_key_with_overflow path (full cell re-encode). The fast path only applies to inline cells.

🚀

Performance Advantage — 5 Allocations → 0 Per Row MariaDB's InnoDB in-place UPDATE (btr_cur_upd_rec_in_place) still allocates an undo record per row for ROLLBACK support. AxiomDB's UndoClusteredFieldPatch stores undo data as inline [u8;8] arrays in the undo log entry — zero heap allocation per row even for ROLLBACK support. For a 25K-row UPDATE t SET score = score + 1, this reduces allocator pressure from ~125K allocs to zero.

Strict Mode and Warning 1265

SessionContext.strict_mode is a bool flag (default true) that controls how INSERT and UPDATE column coercion failures are handled.

Coercion paths

INSERT / UPDATE column value assignment:
  if ctx.strict_mode:
    coerce(value, target_type, CoercionMode::Strict)
      → Ok(v)    : use v
      → Err(e)   : return Err immediately (SQLSTATE 22018)
  else:
    coerce(value, target_type, CoercionMode::Strict)
      → Ok(v)    : use v  (no warning — strict succeeded)
      → Err(_)   : try CoercionMode::Permissive
          → Ok(v) : use v, emit ctx.warn(1265, "Data truncated for column '<col>' at row <n>")
          → Err(e): return Err (both paths failed)

CoercionMode::Permissive performs best-effort conversion: '42abc' → 42, 'abc' → 0, overflowing integers clamped to the type bounds.

Row numbering

insert_row_with_ctx and insert_rows_batch_with_ctx accept an explicit row_num: usize (1-based). The VALUES loop in execute_insert_ctx passes row_idx + 1 from enumerate():

#![allow(unused)]
fn main() {
for (row_idx, value_exprs) in rows.into_iter().enumerate() {
    let values = eval_value_exprs(value_exprs, ...)?;
    engine.insert_row_with_ctx(&mut ctx, values, row_idx + 1)?;
}
}

This makes warning 1265 messages meaningful for multi-row inserts: "Data truncated for column 'stock' at row 2".

SET strict_mode / SET sql_mode

The executor intercepts SET strict_mode and SET sql_mode in execute_set_ctx (called from dispatch_ctx). It delegates to helpers from session.rs:

#![allow(unused)]
fn main() {
"strict_mode" => {
    let b = parse_boolish_setting(&raw)?;
    ctx.strict_mode = b;
}
"sql_mode" => {
    let normalized = normalize_sql_mode(&raw);
    ctx.strict_mode = sql_mode_is_strict(&normalized);
}
}

The wire layer (handler.rs) syncs the wire-visible @@sql_mode and @@strict_mode variables with the session bool after every intercepted SET. Both variables are surfaced in SHOW VARIABLES.

⚙️

Design Decision — Try Strict First, Then Permissive In permissive mode, AxiomDB always tries strict coercion first. A warning is only emitted when the strict path fails and the permissive path succeeds. This means values that coerce cleanly in strict mode (e.g. '42' → 42) never generate a warning in either mode — matching MySQL 8's behavior where warning 1265 is reserved for actual data loss, not clean widening.

Keyboard shortcuts

AxiomDB Documentation