SQL Executor
The executor is the component that interprets an analyzed Stmt (all column
references resolved to col_idx by the semantic analyzer) and drives it to
completion, returning a QueryResult. It is the highest-level component in the
query pipeline.
Since subphase 5.19a, the executor no longer lives in a single source file.
It is organized under crates/axiomdb-sql/src/executor/ with mod.rs as the
stable facade and responsibility-based source files behind it.
Source Layout
| File | Responsibility |
|---|---|
executor/mod.rs | public facade, statement dispatch, thread-local last-insert-id |
executor/shared.rs | helpers shared across multiple statement families |
executor/select.rs | SELECT entrypoints, projection, ORDER BY/LIMIT wiring |
executor/select_core.rs | core SELECT evaluation loop (heap + clustered paths) |
executor/select_ctx.rs | SelectCtx builder and scan context |
executor/select_helpers.rs | projection, column resolution, row materialization |
executor/select_joins_ctx.rs | join context and nested-loop join helpers |
executor/aggregate.rs | GROUP BY, aggregates, DISTINCT/group-key helpers |
executor/insert.rs | INSERT and INSERT … SELECT paths |
executor/insert_heap_ctx.rs | heap INSERT context and batch path |
executor/insert_helpers.rs | AUTO_INCREMENT, constraint checks, default values |
executor/update_ctx.rs | UPDATE context builder |
executor/update_candidates.rs | candidate row collection for heap/clustered UPDATE |
executor/update_entry.rs | single-row UPDATE application |
executor/update_clustered.rs | clustered UPDATE orchestration |
executor/update_clustered_helpers.rs | clustered UPDATE helpers (patch, zero-alloc fast path) |
executor/update_fused_range.rs | fused clustered scan+patch for range UPDATE |
executor/delete.rs | DELETE execution and candidate collection |
executor/bulk_empty.rs | shared bulk-empty helpers for DELETE/TRUNCATE |
executor/staging.rs | transactional INSERT staging flushes and barrier handling |
Integration Test Layout
The executor integration coverage no longer sits in one giant test binary. The
current axiomdb-sql/tests/ layout is responsibility-based, mirroring the
module split in src/executor/.
| Binary | Main responsibility |
|---|---|
integration_executor | CRUD base and simple transaction behavior |
integration_executor_joins | JOINs and aggregate execution |
integration_executor_query | ORDER BY, LIMIT, DISTINCT, CASE, INSERT ... SELECT, AUTO_INCREMENT |
integration_executor_ddl | SHOW, DESCRIBE, TRUNCATE, ALTER TABLE |
integration_executor_ctx | base SessionContext execution and strict_mode |
integration_executor_ctx_group | ctx-path sorted group-by |
integration_executor_ctx_limit | ctx-path LIMIT / OFFSET coercion |
integration_executor_ctx_on_error | ctx-path on_error behavior |
integration_executor_sql | broader SQL semantics outside the ctx path |
integration_delete_apply | bulk and indexed DELETE apply paths |
integration_insert_staging | transactional INSERT staging |
integration_namespacing | database catalog behavior: CREATE/DROP DATABASE, USE, SHOW DATABASES |
integration_namespacing_cross_db | explicit database.schema.table resolution and cross-db DML/DDL |
integration_namespacing_schema | schema namespacing, search_path, and schema-aware SHOW TABLES |
Shared helpers live in crates/axiomdb-sql/tests/common/mod.rs.
The day-to-day workflow is intentionally narrow:
- start with the smallest binary that matches the code path you changed
- add directly related binaries only when the change touches shared helpers or a nearby execution path
- use
cargo test -p axiomdb-sql --testsas the crate-level confidence gate, not as the default inner-loop command - if a new behavior belongs to an existing themed binary, add the test there instead of creating a new binary immediately
cargo test -p axiomdb-sql --test integration_executor_query
cargo test -p axiomdb-sql --test integration_executor_query test_insert_select_aggregation -- --exact
UPDATE Apply Fast Path (6.20)
6.17 fixed indexed UPDATE discovery, but the default update_range benchmark
was still paying most of its cost after rows had already been found. 6.20
removes that apply-side overhead in four steps:
IndexLookup/IndexRangecandidates are decoded throughTableEngine::read_rows_batch(...), which groupsRecordIds bypage_idand restores the original RID order after each page is read once.- UPDATE evaluates the new row image before touching the heap and drops rows
whose
new_values == old_values. - Stable-RID rewrites accumulate
(key, old_tuple_image, new_tuple_image, page_id, slot_id)and emit their normalUpdateInPlaceWAL records through onerecord_update_in_place_batch(...)call. - If any index really is affected, UPDATE now does one grouped delete pass, one grouped insert pass, and one final root persistence write per index.
The coarse executor bailout is statement-level:
if all physically changed rows keep the same RID
and no SET column overlaps any index key column
and no SET column overlaps any partial-index predicate dependency:
skip index maintenance for the statement
This is the common PK-only UPDATE score WHERE id BETWEEN ... case in
local_bench.py.
indexColumnIsBeingUpdated(), and MariaDB's clustered-vs-secondary UPDATE split all ask the same question: did any index-relevant attribute actually change? `6.20` adapts that rule directly, without adding HOT chains or change buffering.
Entry Point
#![allow(unused)]
fn main() {
pub fn execute(
stmt: Stmt,
storage: &mut dyn StorageEngine,
txn: &mut TxnManager,
) -> Result<QueryResult, DbError>
}
When no transaction is active, execute wraps the statement in an implicit
BEGIN / COMMIT (autocommit mode). Transaction control statements (BEGIN,
COMMIT, ROLLBACK) bypass autocommit and operate on TxnManager directly.
All reads use txn.active_snapshot()? — a snapshot fixed at BEGIN — so that
writes made earlier in the same transaction are visible (read-your-own-writes).
Transactional INSERT staging (Phase 5.21)
5.21 adds a statement-boundary staging path for consecutive
INSERT ... VALUES statements inside one explicit transaction.
Data structure
SessionContext now owns:
#![allow(unused)]
fn main() {
PendingInsertBatch {
table_id: u32,
table_def: TableDef,
columns: Vec<ColumnDef>,
indexes: Vec<IndexDef>,
compiled_preds: Vec<Option<Expr>>,
rows: Vec<Vec<Value>>,
unique_seen: HashMap<u32, HashSet<Vec<u8>>>,
}
}
The buffer exists only while the connection is inside an explicit transaction. Autocommit-wrapped single statements do not use it.
Enqueue path
For every eligible INSERT row, executor/insert.rs does all logical work up
front:
- evaluate expressions
- expand omitted columns
- assign AUTO_INCREMENT if needed
- run CHECK constraints
- run FK child validation
- reject duplicate UNIQUE / PK keys against:
- committed index state
unique_seeninside the current batch
- append the fully materialized row to
PendingInsertBatch.rows
No heap write or WAL append happens yet.
Flush barriers
The batch is flushed before:
SELECTUPDATEDELETE- DDL
COMMIT- table switch to another
INSERTtarget - any ineligible INSERT shape
ROLLBACK discards the batch without heap or WAL writes.
Savepoint ordering invariant
When a transaction uses statement-level savepoints (rollback_statement,
savepoint, ignore), the executor must flush staged rows before taking
the next statement savepoint if the current statement cannot continue the batch.
Without that ordering, a failing statement after a table switch could roll back rows that logically belonged to earlier successful INSERT statements.
Flush algorithm
executor/staging.rs performs:
TableEngine::insert_rows_batch_with_ctx(...)batch_insert_into_indexes(...)- one
CatalogWriter::update_index_root(...)per changed index - stats update
The current design still inserts index entries row-by-row inside the flush. That cost is explicit and remains the next insert-side optimization candidate if future profiling shows it dominates after staging.
ClusteredInsertBatch (Phase 40.1)
Phase 40.1 extends the PendingInsertBatch pattern to clustered (primary-key
ordered) tables, eliminating the per-row CoW B-tree overhead that made clustered
inserts 2× slower than heap inserts inside explicit transactions.
Root cause of the pre-40.1 gap
Before 40.1, every clustered INSERT inside an explicit transaction called
apply_clustered_insert_rows immediately, which performs:
storage.read_page(root)— 16 KB page readstorage.write_page(new_root, page)— 16 KB CoW page write- WAL append
- Secondary index write
For N = 50 000 rows that is 100 000 storage operations just for the base tree.
Data structures
#![allow(unused)]
fn main() {
// session.rs
pub struct StagedClusteredRow {
pub values: Vec<Value>,
pub encoded_row: Vec<u8>,
pub primary_key_values: Vec<Value>,
pub primary_key_bytes: Vec<u8>,
}
pub struct ClusteredInsertBatch {
pub table_id: u32,
pub table_def: TableDef,
pub primary_idx: IndexDef,
pub secondary_indexes: Vec<IndexDef>,
pub secondary_layouts: Vec<ClusteredSecondaryLayout>,
pub compiled_preds: Vec<Option<Expr>>,
pub rows: Vec<StagedClusteredRow>,
pub staged_pks: HashSet<Vec<u8>>, // O(1) intra-batch PK dedup
}
}
StagedClusteredRow is structurally identical to PreparedClusteredInsertRow
(defined in clustered_table.rs) but lives in session.rs to avoid a circular
dependency: clustered_table.rs imports SessionContext.
Enqueue path (enqueue_clustered_insert_ctx)
For each row in the VALUES list:
- Evaluate expressions, expand columns, assign AUTO_INCREMENT.
- Validate CHECK constraints and FK child references.
- Encode via
prepare_row_with_ctx(coerce + PK extract + row codec). - Check
staged_pks— returnUniqueViolationand discard batch on intra-batch PK duplicate. - Push
StagedClusteredRowand insert PK bytes intostaged_pks.
Committed-data PK duplicates are caught at flush time by lookup_physical inside
apply_clustered_insert_rows (same as the pre-40.1 single-statement path).
Flush path (flush_clustered_insert_batch)
1. Sort staged rows ascending by pk_bytes
→ enables append-biased detection in apply_clustered_insert_rows
2. Convert StagedClusteredRow → PreparedClusteredInsertRow (field move)
3. Call apply_clustered_insert_rows (existing function):
a. detect append-biased pattern (all PKs increasing)
b. loop: try_insert_rightmost_leaf_batch → fast O(leaf_capacity) write
fallback: single-row clustered_tree::insert
c. WAL record_clustered_insert per row
d. maintain_clustered_secondary_inserts per row
e. persist changed roots
4. ctx.stats.on_rows_changed + ctx.invalidate_all
try_insert_rightmost_leaf_batch to fill each leaf page once.
Barrier detection
should_flush_clustered_batch_before_stmt returns false only when the next
statement is a VALUES INSERT into the same clustered table (batch continues).
For all other statements, the batch is flushed before dispatch. This mirrors the
existing should_flush_pending_inserts_before_stmt logic for heap tables.
ROLLBACK discards the batch via discard_clustered_insert_batch() — no storage
writes, no WAL entries, no undo needed.
CREATE INDEX on clustered tables (Phase 40.1b)
execute_create_index (ddl.rs) now handles both heap and clustered tables with a
single function. The dispatch happens after the B-Tree root page is allocated:
if table_def.is_clustered() {
primary_idx ← CatalogReader::list_indexes → find(is_primary)
preview_def ← IndexDef { columns, is_unique, fillfactor, root_page_id, … }
layout ← ClusteredSecondaryLayout::derive(&preview_def, &primary_idx)
rows ← scan_clustered_table(storage, &table_def, &col_defs, snap)
for row in rows:
if partial predicate → skip non-matching rows
entry = layout.entry_from_row(row) → physical_key for bloom
layout.insert_row(storage, &root_pid, row) → uniqueness + B-Tree insert
} else {
rows ← scan_table(…)
for row in rows: encode_index_key + BTree::insert_in
}
// step 8: stats bootstrap uses same `rows` Vec — no extra I/O
The ClusteredSecondaryLayout encodes the physical key as
secondary_cols ++ suffix_primary_cols — exactly the format used by runtime
INSERT/UPDATE/DELETE, so a clustered secondary index built by CREATE INDEX
is byte-for-byte compatible with those written by the DML executors.
entry_from_row is called once per row to collect the physical key for the bloom
filter, and insert_row calls it again internally for the B-Tree write. This is
acceptable overhead during a DDL operation (O(n) with constant factor ≈2).
NULL handling
ClusteredSecondaryLayout::entry_from_row returns None when any secondary column
is NULL. Both the bloom key collection and the B-Tree insert are skipped in that case,
consistent with the runtime INSERT path and SQL standard NULL semantics for indexes.
Uniqueness enforcement
insert_row delegates uniqueness to ensure_unique_logical_key_absent, the same
function used at runtime. If an existing row already carries that logical key, the
build fails with DbError::UniqueViolation before the catalog entry is written.
Query Pipeline
SQL string
→ tokenize() logos DFA, ~85 tokens, zero-copy &str
→ parse() recursive descent, produces Stmt with col_idx = 0
→ analyze() BindContext resolves every col_idx
→ execute() dispatches to per-statement handler
├── scan_table HeapChain::scan_visible + decode_row
├── filter eval(WHERE, &row) + is_truthy
├── join nested-loop, apply_join
├── aggregate hash-based GroupState
├── sort apply_order_by, compare_sort_values
├── deduplicate apply_distinct, value_to_key_bytes
├── project project_row / project_grouped_row
└── paginate apply_limit_offset
→ QueryResult::Rows / Affected / Empty
JOIN — Nested Loop
Phase 4 implements nested-loop joins. All tables are pre-scanned once before any loop begins — scanning inside the inner loop would re-read the same data O(n) times and could see partially-inserted rows.
Algorithm
scanned[0] = scan(FROM table)
scanned[1] = scan(JOIN[0] table)
...
combined_rows = scanned[0]
for each JoinClause in stmt.joins:
combined_rows = apply_join(combined_rows, scanned[i+1], join_type, ON/USING)
apply_join per type
| Join type | Behavior |
|---|---|
INNER / CROSS | Emit combined row for each pair where ON is truthy |
LEFT | Emit all left rows; unmatched left → right side padded with NULL |
RIGHT | Emit all right rows; unmatched right → left side padded with NULL; uses a matched_right: Vec<bool> bitset |
FULL | NotImplemented — Phase 4.8+ |
USING condition
USING(col_name) is resolved at execution time using left_schema: Vec<(name, col_idx)>,
accumulated across all join stages. The condition combined[left_idx] == combined[right_idx]
uses SQL equality — NULL = NULL returns UNKNOWN (false), so NULLs never match in USING.
GROUP BY — Strategy Selection (Phase 4.9b)
The executor selects between two GROUP BY execution strategies at runtime:
| Strategy | When selected | Behavior |
|---|---|---|
Hash | Default; JOINs; derived tables; plain scans | HashMap per group key; O(k) memory |
Sorted { presorted: true } | Single-table ctx path + compatible B-Tree index | Stream adjacent equal groups; O(1) memory |
#![allow(unused)]
fn main() {
enum GroupByStrategy {
Hash,
Sorted { presorted: bool },
}
}
Strategy selection (choose_group_by_strategy_ctx) is only active on the
single-table ctx path (execute_with_ctx). All JOIN, derived-table, and
non-ctx paths use Hash.
Prefix Match Rule
The sorted strategy is selected when all four conditions hold:
- Access method is
IndexLookup,IndexRange, orIndexOnlyScan. - Every
GROUP BYexpression is a plainExpr::Column(no function calls, no aliases). - The column references match the leading key prefix of the chosen index in the same order.
- The prefix length ≤ number of index columns.
Examples (index (region, dept)):
| GROUP BY | Result |
|---|---|
region, dept | ✅ Sorted |
region | ✅ Sorted (prefix) |
dept, region | ❌ Hash (wrong order) |
LOWER(region) | ❌ Hash (computed expression) |
This is correct because BTree::range_in guarantees rows arrive in key order,
and equal leading prefixes are contiguous even with extra suffix columns or RID
suffixes on non-unique indexes.
GroupAggregate (sorted) and HashAggregate as separate strategies selected at planning time (pathnodes.h). DuckDB selects the aggregation strategy at physical plan time based on input guarantees. AxiomDB borrows the two-strategy concept but selects at execution time using the already-chosen access method — no separate planner pass needed.
GROUP BY — Hash Aggregation
GROUP BY uses a single-pass hash aggregation strategy: one scan through the filtered rows, accumulating aggregate state per group key.
Specialized Hash Tables (subfase 39.21)
Two hash table types avoid generic dispatch overhead:
-
GroupTablePrimitive— single-column GROUP BY on integer-like values (INT,BIGINT,DOUBLE,Bool). Mapsi64→GroupEntryviahashbrown::HashMap<i64, usize>. No key serialization needed; comparison is a single integer equality check. -
GroupTableGeneric— multi-column GROUP BY, TEXT columns, mixed types, and the global no-GROUP-BY case. Serializes group keys into aVec<u8>reused across rows (zero allocation when capacity fits), maps&[u8]→GroupEntryviahashbrown::HashMap<Box<[u8]>, usize>.
Both tables store entries in a Vec<GroupEntry> and use the hash maps as index
structures. This keeps entries contiguous in memory and avoids pointer chasing
during the accumulation loop.
hashbrown (the same table backing Rust's std::HashMap) uses SIMD-accelerated quadratic probing (SSE2/NEON). For a 62-group workload over 50K rows, this cuts probe overhead by ~30% vs a naïve open-addressing table. The specialized GroupTablePrimitive path avoids serialization entirely, reducing per-row work to one integer hash + one equality check.
Group Key Serialization
Value contains f64 which does not implement Hash in Rust. AxiomDB uses a
custom self-describing byte serialization instead of the row codec:
value_to_key_bytes(Value::Null) → [0x00]
value_to_key_bytes(Value::Int(n)) → [0x02, n as 4 LE bytes]
value_to_key_bytes(Value::Text(s)) → [0x06, len as 4 LE bytes, UTF-8 bytes]
...
Two NULL values produce identical bytes [0x00] → they form one group.
This matches SQL GROUP BY semantics: NULLs are considered equal for grouping
(unlike NULL = NULL in comparisons, which is UNKNOWN).
The group key for a multi-column GROUP BY is the concatenation of all column
serializations. The key_buf: Vec<u8> is allocated once before the scan loop
and reused (with clear() + extend_from_slice) for every row, so multi-column
GROUP BY does not allocate per row for the probe step.
GroupEntry
Each unique group key maps to a GroupEntry:
#![allow(unused)]
fn main() {
struct GroupEntry {
key_values: Vec<Value>, // GROUP BY expression results (for output)
non_agg_col_values: Vec<Value>, // non-aggregate SELECT cols (for HAVING/output)
accumulators: Vec<AggAccumulator>,
}
}
non_agg_col_values is a sparse slice: only columns referenced by non-aggregate
SELECT items or HAVING expressions are stored. Their indices are pre-computed once
(compute_non_agg_col_indices) before the scan loop and reused for every group.
representative_row: Row — the full first source row per group — to resolve HAVING column references. This costs one full Vec<Value> clone per group, regardless of how many columns HAVING actually needs. non_agg_col_values stores only the columns referenced by non-aggregate SELECT items and HAVING, computed once before the scan loop. For a 6-column table where HAVING references 1 column, this reduces per-group memory by ~83%.
Aggregate Accumulators
| Aggregate | Accumulator | NULL behavior |
|---|---|---|
COUNT(*) | u64 counter | Increments for every row |
COUNT(col) | u64 counter | Skips rows where col is NULL |
SUM(col) | Option<Value> | Skips NULL; None if all rows are NULL |
MIN(col) | Option<Value> | Skips NULL; tracks running minimum |
MAX(col) | Option<Value> | Skips NULL; tracks running maximum |
AVG(col) | (sum: Value, count: u64) | Skips NULL; final = sum / count as Real |
AVG always returns Real (SQL standard), even for integer columns. This
avoids integer truncation (MySQL-style AVG(INT) returns DECIMAL but truncates
in many contexts). AVG of all-NULL rows returns NULL.
Fast-path arithmetic (value_agg_add): For SUM, MIN, MAX, and COUNT,
the accumulator is updated via direct arithmetic on Value variants, bypassing
eval(). This eliminates the expression evaluator overhead for the innermost
loop of the aggregate scan.
Ungrouped Aggregates
SELECT COUNT(*) FROM t (no GROUP BY) is handled as a single-group query with
an empty key. Even on an empty table, the executor emits exactly one output
row — (0) for COUNT(*), NULL for SUM/MIN/MAX/AVG. This matches the
SQL standard and every major database.
Column Decode Mask
Before scanning, collect_expr_columns walks all expressions in SELECT items,
WHERE, GROUP BY, HAVING, and ORDER BY to build a Vec<bool> mask indexed by
column position. Only columns with mask[i] == true are decoded from the row
bytes. For a SELECT age, AVG(score) FROM users GROUP BY age query on a
6-column table, this skips decoding name and email (TEXT fields) entirely.
The mask is forwarded to scan_clustered_table_masked as Option<&[bool]> and
passed into decode_row_masked at the codec level, which skips variable-length
fields that are not needed.
GROUP BY — Sorted Streaming Executor (Phase 4.9b)
The sorted executor replaces the hash table with a single linear pass over pre-ordered rows, accumulating state for the current group and emitting it when the key changes.
Algorithm
rows_with_keys = [(row, eval(group_by exprs, row)) for row in combined_rows]
if !presorted:
stable_sort rows_with_keys by compare_group_key_lists
current_key = rows_with_keys[0].key_values
current_accumulators = AggAccumulator::new() for each aggregate
update accumulators with rows_with_keys[0].row
for next in rows_with_keys[1..]:
if group_keys_equal(current_key, next.key_values):
update accumulators with next.row
else:
finalize → apply HAVING → emit output row
reset: current_key = next.key_values, new accumulators, update
finalize last group
Key Comparison
#![allow(unused)]
fn main() {
fn compare_group_key_lists(a: &[Value], b: &[Value]) -> Ordering
fn group_keys_equal(a: &[Value], b: &[Value]) -> bool
}
Uses compare_values_null_last so NULL == NULL for grouping (consistent with
the hash path’s serialization). Comparison is left-to-right: returns the first
non-Equal ordering.
Shared Aggregate Machinery
Both hash and sorted executors reuse the same:
AggAccumulator(state, update, finalize)eval_with_aggs(HAVING evaluation)project_grouped_row(output projection)build_grouped_column_meta(column metadata)- GROUP_CONCAT handling
- Post-group DISTINCT / ORDER BY / LIMIT
O(1) accumulator memory (one group at a time) instead of O(k) where k = distinct groups. For a high-cardinality column with many distinct values, this eliminates the entire hash table allocation.
ORDER BY — Multi-Column Sort
ORDER BY is applied after scan + filter + aggregation but before projection
for non-GROUP BY queries. For GROUP BY queries, it is applied to the projected
output rows after remap_order_by_for_grouped rewrites column references.
ORDER BY in GROUP BY Context — Expression Remapping
Grouped output rows are indexed by SELECT output position: position 0 = first
SELECT item, position 1 = second, etc. ORDER BY expressions, however, are
analyzed against the source schema where Expr::Column { col_idx } refers to
the original table column.
remap_order_by_for_grouped fixes this mismatch before calling apply_order_by:
remap_order_by_for_grouped(order_by, select_items):
for each ORDER BY item:
rewrite expr via remap_expr_for_grouped(expr, select_items)
remap_expr_for_grouped(expr, select_items):
if expr == select_items[pos].expr (structural PartialEq):
return Column { col_idx: pos } // output position
match expr:
BinaryOp → recurse into left, right
UnaryOp → recurse into operand
IsNull → recurse into inner
Between → recurse into expr, low, high
Function → recurse into args
other → return unchanged
This means ORDER BY dept (where dept is Expr::Column{col_idx:1} in the
source) becomes Expr::Column{col_idx:0} when the SELECT is SELECT dept, COUNT(*),
correctly indexing into the projected output row.
Aggregate expressions like ORDER BY COUNT(*) are matched structurally:
if Expr::Function{name:"count", args:[]} appears in the SELECT at position 1,
it is rewritten to Expr::Column{col_idx:1}.
PartialEq on Expr (which is derived) to identify ORDER BY expressions that match SELECT items. This is simpler than PostgreSQL's SortClause/TargetEntry reference system and correct for the common cases (column references, aggregates, compound expressions).
NULL Ordering Defaults (PostgreSQL-compatible)
| Direction | Default | Override |
|---|---|---|
ASC | NULLs LAST | NULLS FIRST |
DESC | NULLs FIRST | NULLS LAST |
compare_sort_values(a, b, direction, nulls_override):
nulls_first = explicit_nulls_order OR (DESC && no explicit)
if a = NULL and b = NULL → Equal
if a = NULL → Less if nulls_first, else Greater
if b = NULL → Greater if nulls_first, else Less
otherwise → compare a and b, reverse if DESC
Non-NULL comparison delegates to eval(BinaryOp{Lt}, Literal(a), Literal(b))
via the expression evaluator, reusing all type coercion and promotion logic.
Error Propagation from sort_by
Rust’s sort_by closure cannot return Result. AxiomDB uses the sort_err
pattern: errors are captured in Option<DbError> during the sort and returned
after it completes.
#![allow(unused)]
fn main() {
let mut sort_err: Option<DbError> = None;
rows.sort_by(|a, b| {
match compare_rows_for_sort(a, b, order_items) {
Ok(ord) => ord,
Err(e) => { sort_err = Some(e); Equal }
}
});
if let Some(e) = sort_err { return Err(e); }
}
DISTINCT — Deduplication
SELECT DISTINCT is applied after projection and before LIMIT/OFFSET, using
a HashSet<Vec<u8>> keyed by value_to_key_bytes.
fn apply_distinct(rows: Vec<Row>) -> Vec<Row>:
seen = HashSet::new()
for row in rows:
key = concat(value_to_key_bytes(v) for v in row)
if seen.insert(key): // first occurrence
keep row
Two rows are identical if every column value serializes to the same bytes.
Critically, NULL → [0x00] means two NULLs are considered equal for
deduplication — only one row with a NULL in that position is kept. This is the
SQL standard behavior for DISTINCT, and is different from equality comparison
where NULL = NULL returns UNKNOWN.
LIMIT / OFFSET — Row-Count Coercion (Phase 4.10d)
apply_limit_offset runs after ORDER BY and DISTINCT. It calls
eval_row_count_as_usize for each row-count expression.
Row-count coercion contract
| Evaluated value | Result |
|---|---|
Int(n) where n ≥ 0 | n as usize |
BigInt(n) where n ≥ 0 | usize::try_from(n) — errors on overflow |
Text(s) where s.trim() parses as an exact base-10 integer ≥ 0 | parsed value as usize |
negative Int or BigInt | DbError::TypeMismatch |
non-integral Text ("10.1", "1e3", "abc") | DbError::TypeMismatch |
NULL, Bool, Real, Decimal, Date, Timestamp | DbError::TypeMismatch |
Text coercion is intentionally narrow: only exact base-10 integers are accepted. Scientific notation, decimal fractions, and time-like strings are all rejected.
Why Text is accepted
The prepared-statement SQL-string substitution path serializes a Value::Text("2")
parameter as LIMIT '2' in the generated SQL. Without Text coercion, the fallback
path would always fail for string-bound LIMIT parameters — which is the binding
type used by some MariaDB clients. Accepting exact integer Text keeps the
cached-AST prepared path and the SQL-string fallback path on identical semantics.
coerce() function here.
coerce() uses assignment-coercion semantics and would change the
error class to InvalidCoercion, masking the semantic error.
eval_row_count_as_usize implements the narrower 4.10d contract
directly in the executor, keeping the error class and message family consistent
for both prepared paths.
INSERT … SELECT — MVCC Isolation
INSERT INTO target SELECT ... FROM source executes the SELECT phase under
the same snapshot as any other read in the transaction — fixed at BEGIN.
This prevents the “Halloween problem”: rows inserted by this INSERT have
txn_id_created = current_txn_id. The snapshot was taken before any insert
occurred, so snapshot_id ≤ current_txn_id. The MVCC visibility rule
(txn_id_created < snapshot_id) causes newly inserted rows to be invisible to
the SELECT scan. The result:
- If
source = target(inserting from a table into itself): the SELECT sees exactly the rows that existed atBEGIN. The inserted copies are not re-scanned. No infinite loop. - If another transaction inserts rows into
sourceafter this transaction’sBEGIN: those rows are also invisible (consistent snapshot).
Before BEGIN: source = {row1, row2}
After BEGIN: snapshot_id = 3 (max_committed = 2)
INSERT INTO source SELECT * FROM source:
SELECT sees: {row1 (xmin=1), row2 (xmin=2)} — both have xmin < snapshot_id ✅
Inserts: row3 (xmin=3), row4 (xmin=3) — xmin = current_txn_id = 3
SELECT does NOT see row3 or row4 (xmin ≮ snapshot_id) ✅
After COMMIT: source = {row1, row2, row3, row4} ← exactly 2 new rows, not infinite
Subquery Execution
Subquery execution is integrated into the expression evaluator via the
SubqueryRunner trait. This design allows the compiler to eliminate all subquery
dispatch overhead in the non-subquery path at zero runtime cost.
SubqueryRunner Trait
#![allow(unused)]
fn main() {
pub trait SubqueryRunner {
fn eval_scalar(&mut self, subquery: &SelectStmt) -> Result<Value, DbError>;
fn eval_in(&mut self, subquery: &SelectStmt, needle: &Value) -> Result<Value, DbError>;
fn eval_exists(&mut self, subquery: &SelectStmt) -> Result<bool, DbError>;
}
}
All expression evaluation is dispatched through eval_with<R: SubqueryRunner>:
#![allow(unused)]
fn main() {
pub fn eval_with<R: SubqueryRunner>(
expr: &Expr,
row: &Row,
runner: &mut R,
) -> Result<Value, DbError>
}
Two concrete implementations exist:
| Implementation | Purpose |
|---|---|
NoSubquery | Used for simple expressions with no subqueries. All three SubqueryRunner methods are unreachable!(). Monomorphization guarantees they are dead code. |
ExecSubqueryRunner<'a> | Used when the query contains at least one subquery. Holds mutable references to storage, the transaction manager, and the outer row for correlated access. |
SubqueryRunner as a generic trait parameter — rather than a runtime Option<&mut dyn FnMut> or a boolean flag — allows the compiler to generate two separate code paths: eval_with::<NoSubquery> and eval_with::<ExecSubqueryRunner>. In the NoSubquery path, every subquery branch is dead code and is eliminated by LLVM. A runtime option would add a pointer-width check plus a potential indirect call on every expression node evaluation, even for the 99% of expressions that have no subqueries.
Scalar Subquery Evaluation
ExecSubqueryRunner::eval_scalar executes the inner SelectStmt fully using
the existing execute_select path, then inspects the result:
eval_scalar(subquery):
result = execute_select(subquery, storage, txn)
match result.rows.len():
0 → Value::Null
1 → result.rows[0][0] // single column, single row
n > 1 → Err(CardinalityViolation { returned: n })
The inner SELECT is always run with a fresh output context. It inherits the outer transaction snapshot so it sees the same consistent view as the outer query.
IN Subquery Evaluation
eval_in materializes the subquery result into a HashSet<Value>, then applies
three-valued logic:
eval_in(subquery, needle):
rows = execute_select(subquery)
values: HashSet<Value> = rows.map(|r| r[0]).collect()
if values.contains(needle):
return Value::Bool(true)
if values.contains(Value::Null):
return Value::Null // unknown — could match
return Value::Bool(false)
For NOT IN, the calling code wraps the result: TRUE → FALSE, FALSE → TRUE,
NULL → NULL (NULL propagates unchanged).
EXISTS Evaluation
eval_exists executes the subquery and checks whether the result set is non-empty.
No rows are materialized beyond the first:
eval_exists(subquery):
rows = execute_select(subquery)
return !rows.is_empty() // always bool, never null
Correlated Subqueries — substitute_outer
Before executing a correlated subquery, ExecSubqueryRunner walks the subquery
AST and replaces every Expr::OuterColumn { col_idx, depth: 1 } with a concrete
Expr::Literal(value) from the current outer row. This operation is called
substitute_outer:
substitute_outer(expr_tree, outer_row):
for each node in expr_tree:
if node = OuterColumn { col_idx, depth: 1 }:
replace with Literal(outer_row[col_idx])
if node = OuterColumn { col_idx, depth: d > 1 }:
decrement depth by 1 // pass through for deeper nesting
After substitution, the subquery is a fully self-contained statement with no
outer references, and it is executed by the standard execute_select path.
Re-execution happens once per outer row: for a correlated EXISTS in a query
that produces 10,000 outer rows, the inner query is executed 10,000 times.
For large datasets, rewriting as a JOIN is recommended.
Derived Table Execution
A derived table (FROM (SELECT ...) AS alias) is materialized once at the
start of query execution, before any scan or filter of the outer query begins:
execute_select(stmt):
for each TableRef::Derived { subquery, alias } in stmt.from:
materialized[alias] = execute_select(subquery) // fully materialized in memory
// outer query scans materialized[alias] as if it were a base table
The materialized result is an in-memory Vec<Row> wrapped in a
MaterializedTable. The outer query uses the derived table’s output schema
(column names from the inner SELECT list) for column resolution.
Derived tables are not correlated — they cannot reference columns from the outer
query. Lateral joins (which allow correlation in FROM) are not yet supported.
Foreign Key Enforcement
FK constraints are validated during DML operations by crates/axiomdb-sql/src/fk_enforcement.rs.
Catalog Storage
Each FK is stored as a FkDef row in the axiom_foreign_keys heap (5th system table,
root page at meta offset 84). Fields:
fk_id, child_table_id, child_col_idx, parent_table_id, parent_col_idx,
on_delete: FkAction, on_update: FkAction, fk_index_id: u32, name: String
FkAction encoding: 0=NoAction, 1=Restrict, 2=Cascade, 3=SetNull, 4=SetDefault.
fk_index_id != 0 → FK auto-index exists (composite key, Phase 6.9).
fk_index_id = 0 → no auto-index; enforcement falls back to full table scan.
FK auto-index — composite key (fk_val | RecordId) (Phase 6.9)
Each FK constraint auto-creates a B-Tree index on the child FK column using a composite key format that makes every entry globally unique:
key = encode_index_key(&[fk_val]) ++ encode_rid(rid) (10 bytes RecordId suffix)
This follows InnoDB’s approach of appending the primary key as a tiebreaker
(row0row.cc). Every entry is unique even when many rows share the same FK value.
Range scan for all children with a given parent key:
#![allow(unused)]
fn main() {
lo = encode_index_key(&[parent_key]) ++ [0x00; 10] // smallest RecordId
hi = encode_index_key(&[parent_key]) ++ [0xFF; 10] // largest RecordId
children = BTree::range_in(fk_index_root, lo, hi) // O(log n + k)
}
INSERT / UPDATE child — check_fk_child_insert
For each FK on the child table:
1. FK column is NULL → skip (MATCH SIMPLE)
2. Encode FK value as B-Tree key
3. Find parent's PK or UNIQUE index covering parent_col_idx
4. Bloom shortcut: if filter says absent → ForeignKeyViolation immediately
5. BTree::lookup_in(parent_index_root, key) — O(log n)
6. No match → ForeignKeyViolation (SQLSTATE 23503)
PK indexes are populated on every INSERT since Phase 6.9 (removed !is_primary
filter in insert_into_indexes). All index types now use B-Tree + Bloom lookup.
DELETE parent — enforce_fk_on_parent_delete
Called before the parent rows are deleted. For each FK referencing this table:
| Action | Behavior |
|---|---|
| RESTRICT / NO ACTION | BTree::range_in(fk_index) — O(log n); error if any child found |
| CASCADE | Range scan finds all children; recursive delete (depth limit = 10) |
| SET NULL | Range scan finds all children; updates FK column to NULL |
Cascade recursion uses depth parameter — exceeding 10 levels returns
ForeignKeyCascadeDepth (SQLSTATE 23503).
Query Planner Cost Gate (Phase 6.10)
Before returning IndexLookup or IndexRange, plan_select applies a cost gate
using per-column statistics to decide if the index scan is worth the overhead.
Algorithm
ndv = stats.ndv > 0 ? stats.ndv : DEFAULT_NUM_DISTINCT (= 200)
selectivity = 1.0 / ndv // equality predicate: 1/ndv rows match
if selectivity > 0.20:
return Scan // too many rows — full scan is cheaper
if stats.row_count < 1,000:
return Scan // tiny table — index overhead not worth it
return IndexLookup / IndexRange // selective enough — use index
Constants derived from PostgreSQL:
INDEX_SELECTIVITY_THRESHOLD = 0.20(PG default:seq/random_page_cost = 0.25; AxiomDB is slightly more conservative for embedded storage)DEFAULT_NUM_DISTINCT = 200(PGDEFAULT_NUM_DISTINCTinselfuncs.c)
Stats are loaded once per SELECT
In execute_select_ctx, before calling plan_select:
#![allow(unused)]
fn main() {
let table_stats = CatalogReader::new(storage, snap)?.list_stats(table_id)?;
let access_method = plan_select(where_clause, indexes, columns, table_id,
&table_stats, &mut ctx.stats);
}
If table_stats is empty (pre-6.10 database or ANALYZE never run),
plan_select conservatively uses the index — never wrong, just possibly suboptimal.
Staleness (StaleStatsTracker)
StaleStatsTracker lives in SessionContext and tracks row changes per table:
INSERT / DELETE row → on_row_changed(table_id)
changes > 20% of baseline → mark stale
planner loads stats → set_baseline(table_id, row_count)
ANALYZE TABLE → mark_fresh(table_id)
When stale, the planner uses ndv = DEFAULT_NUM_DISTINCT = 200 regardless of
catalog stats, preventing stale low-NDV estimates from causing full scans on
high-selectivity columns.
Bloom Filter — Index Lookup Shortcut
The executor holds a BloomRegistry (one per database connection) that maps
index_id → Bloom<Vec<u8>>. Before performing any B-Tree lookup for an index
equality predicate, the executor consults the filter:
#![allow(unused)]
fn main() {
// In execute_select_ctx — IndexLookup path
if !bloom.might_exist(index_def.index_id, &encoded_key) {
// Definite absence: skip B-Tree entirely.
return Ok(vec![]);
}
// False positive or true positive: proceed with B-Tree.
BTree::lookup_in(storage, index_def.root_page_id, &encoded_key)?
}
BloomRegistry API
#![allow(unused)]
fn main() {
pub struct BloomRegistry { /* per-index filters */ }
impl BloomRegistry {
pub fn create(&mut self, index_id: u32, expected_items: usize);
pub fn add(&mut self, index_id: u32, key: &[u8]);
pub fn might_exist(&self, index_id: u32, key: &[u8]) -> bool;
pub fn mark_dirty(&mut self, index_id: u32);
pub fn remove(&mut self, index_id: u32);
}
}
might_exist returns true (conservative) for unknown index_ids — correct
behavior for indexes that existed before the current server session (no filter
was populated for them at startup).
DML Integration
Every DML handler in the execute_with_ctx path updates the registry:
| Handler | Bloom action |
|---|---|
execute_insert_ctx | bloom.add(index_id, &key) after each B-Tree insert |
execute_update_ctx | mark_dirty() for delete side (batch); add() for insert side |
execute_delete_ctx | mark_dirty(index_id) once per index batch (5.19) |
execute_create_index | create(index_id, n) then add() for every existing key |
execute_drop_index | remove(index_id) |
Memory Budget
Each filter is sized at max(2 × expected_items, 1000) with a 1% FPR target
(~9.6 bits/key, 7 hash functions). For a 1M-row table with one secondary index:
2M × 9.6 bits ≈ 2.4 MB.
ANALYZE TABLE (Phase 6.12),
mirroring PostgreSQL's lazy statistics-rebuild model.
IndexOnlyScan — Heap-Free Execution
When plan_select returns AccessMethod::IndexOnlyScan, the executor reads
all result values directly from the B-Tree key bytes, with only a lightweight
MVCC visibility check against the heap slot header.
This section applies to the heap executor path. Since 39.15, clustered tables
do not execute this path directly even if the planner initially detects a
covering opportunity. Clustered covering plans are normalized back to
clustered-aware lookup/range access, because clustered visibility lives in the
inline row header and clustered secondary indexes carry PK bookmarks instead of
stable heap RecordIds.
Clustered UPDATE (39.16)
Clustered tables no longer fall back to heap-era UPDATE logic. The executor now
routes explicit-PRIMARY KEY tables through clustered candidate discovery and
clustered rewrite primitives:
- discover candidates through the clustered access planner: PK lookup, PK range, secondary bookmark probe, or full clustered scan
- capture the exact old clustered row image (
RowHeader+ full logical row bytes) before any mutation - choose one of three clustered write paths:
- same-key in-place rewrite via
clustered_tree::update_in_place(...) - same-key relocation via
clustered_tree::update_with_relocation(...) - PK change via
delete_mark(old_pk)+insert(new_pk, ...)
- same-key in-place rewrite via
- rewrite clustered secondary bookmark entries and register both index-insert and index-delete undo records so rollback can restore the old bookmark state
Clustered DELETE (39.17)
Clustered tables no longer fall back to heap-era DELETE logic either. The
executor now routes explicit-PRIMARY KEY tables through clustered candidate
discovery and clustered delete-mark primitives:
- discover candidates through the clustered access planner: PK lookup, PK range, secondary bookmark probe, or full clustered scan
- decode the exact current clustered row image before any mutation
- enforce parent-side foreign-key restrictions before the first delete-mark
- call
clustered_tree::delete_mark(...)for each matched primary key - record
EntryType::ClusteredDeleteMarkwith the exact old and new row images so rollback/savepoints restore the original header and payload bytes - leave clustered secondary bookmark entries in place for deferred cleanup during later clustered VACUUM work
Clustered VACUUM (39.18)
Clustered tables now have their own executor-visible maintenance path too.
VACUUM table_name dispatches by table storage layout:
- compute
oldest_safe_txn = max_committed + 1 - descend once to the leftmost clustered leaf
- walk the
next_leafchain and remove cells whosetxn_id_deletedis safe - free any overflow chain owned by each purged cell
- defragment the leaf when freeblock waste exceeds the page-local threshold
- scan each clustered secondary index, decode the PK bookmark from the physical secondary key, and keep only entries whose clustered row still exists physically after the leaf purge
- persist any secondary root rotation caused by bulk delete back into the catalog
Execution Path
IndexOnlyScan { index_def, lo, hi, n_key_cols, needed_key_positions }:
for (rid, key_bytes) in BTree::range_in(storage, index_def.root_page_id, lo, hi):
page_id = rid.page_id
slot_id = rid.slot_id
// MVCC: read only the 24-byte RowHeader — no full row decode.
visible = HeapChain::is_slot_visible(storage, page_id, slot_id, snap)
if !visible:
continue
// Extract column values from B-Tree key bytes (no heap page needed).
(decoded_cols, _) = decode_index_key(&key_bytes, n_key_cols)
// Project only the columns the query requested.
row = needed_key_positions.iter().map(|&p| decoded_cols[p].clone()).collect()
emit row
The 24-byte RowHeader contains txn_id_created, txn_id_deleted, and a
sequence number — enough for full MVCC visibility evaluation without loading
the row payload.
decode_index_key — Self-Delimiting Key Decoder
decode_index_key lives in key_encoding.rs and is the exact inverse of
encode_index_key. It uses type tags embedded in the key bytes to self-delimit
each value without needing an external schema:
| Tag byte | Type | Encoding |
|---|---|---|
0x00 | NULL | tag only, 0 payload bytes |
0x01 | Bool | tag + 1 byte (0 = false, 1 = true) |
0x02 | Int (positive, 1 B) | tag + 1 LE byte |
0x03 | Int (positive, 2 B) | tag + 2 LE bytes |
0x04 | Int (positive, 4 B) | tag + 4 LE bytes |
0x05 | Int (negative, 4 B) | tag + 4 LE bytes (i32) |
0x06 | BigInt (positive, 1 B) | tag + 1 byte |
0x07 | BigInt (positive, 4 B) | tag + 4 LE bytes |
0x08 | BigInt (positive, 8 B) | tag + 8 LE bytes |
0x09 | BigInt (negative, 8 B) | tag + 8 LE bytes (i64) |
0x0A | Real | tag + 8 LE bytes (f64 bits) |
0x0B | Text | tag + NUL-terminated UTF-8 (NUL = end marker) |
0x0C | Bytes | tag + NUL-escaped bytes (0x00 → [0x00, 0xFF], NUL terminator = [0x00, 0x00]) |
#![allow(unused)]
fn main() {
// Signature
pub fn decode_index_key(key: &[u8], n_cols: usize) -> (Vec<Value>, usize)
// Returns: (decoded column values, total bytes consumed)
}
The self-delimiting format means decode_index_key requires no column type
metadata — the tag bytes carry all necessary type information. This is the
same approach used by SQLite’s record format and RocksDB’s comparator-encoded
keys.
Full-Width Row Layout in IndexOnlyScan Output
IndexOnlyScan emits full-width rows — the same width as a heap row — with
index key column values placed at their table col_idx positions and NULL
everywhere else. This is required because downstream operators (WHERE
re-evaluation, projection, expression evaluator) all address columns by their
original table column index, not by SELECT output position.
table: (id INT [0], name TEXT [1], age INT [2], dept TEXT [3])
index: ON (age, dept) ← covers col_idx 2 and 3
IndexOnlyScan emits: [NULL, NULL, <age_val>, <dept_val>]
col0 col1 col2 col3
If the executor placed decoded values at positions 0, 1, ... instead, a
WHERE age > 25 re-evaluation would read col_idx=2 from a 2-element row and
panic with ColumnIndexOutOfBounds. The full-width layout eliminates this class
of error entirely.
execute_with_ctx — Required for IndexOnlyScan Selection
The planner selects IndexOnlyScan only when select_col_idxs (the set of
columns touched by the query) is a subset of the index’s key columns. The
select_col_idxs argument is supplied by execute_with_ctx; the simpler
execute entry-point passes an empty slice, so IndexOnlyScan is never selected
through it.
Test coverage for this path lives in
crates/axiomdb-sql/tests/integration_index_only.rs — functions prefixed
test_ctx_ use execute_with_ctx with real select_col_idxs and are the
only tests that exercise the IndexOnlyScan access method end-to-end.
Non-Unique Secondary Index Key Format
Non-unique secondary indexes append a 10-byte RecordId suffix to every
B-Tree key to guarantee uniqueness across all entries:
key = encode_index_key(col_vals) || encode_rid(rid)
^^^^^^^^^^^^^^
page_id (4 B) + slot_id (2 B) + seq (4 B) = 10 bytes
This prevents DuplicateKey errors when two rows share the same indexed value,
because the RecordId suffix always makes the full key distinct.
Lookup Bounds for Non-Unique Indexes
To find all rows matching a specific indexed value, the executor performs a
range scan using synthetic [lo, hi] bounds that span all possible RecordId
suffixes:
#![allow(unused)]
fn main() {
lo = encode_index_key(&[val]) + [0x00; 10] // smallest RecordId
hi = encode_index_key(&[val]) + [0xFF; 10] // largest RecordId
BTree::range_in(root, lo, hi) // returns all entries for val
}
row0row.cc). AxiomDB uses
RecordId (page_id + slot_id + sequence) instead of a separate
primary key column, keeping the suffix at a fixed 10 bytes regardless of the
table's key type — simpler to encode and guaranteed to be globally unique within
the storage engine's address space.
Performance Characteristics
| Operation | Time complexity | Notes |
|---|---|---|
| Table scan | O(n) | HeapChain linear traversal |
| Nested loop JOIN | O(n × m) | Both sides materialized before loop |
| Hash GROUP BY | O(n) | One pass; O(k) memory where k = distinct groups |
| Sorted GROUP BY | O(n) | One pass; O(1) accumulator memory per group |
| Sort ORDER BY | O(n log n) | sort_by (stable, in-memory) |
| DISTINCT | O(n) | One HashSet pass |
| LIMIT/OFFSET | O(1) after sort | skip(offset).take(limit) |
All operations are in-memory for Phase 4. External sort and hash spill for large datasets are planned for Phase 14 (vectorized execution).
AUTO_INCREMENT Execution
Per-Table Sequence State
Each table that has an AUTO_INCREMENT column maintains a sequence counter.
The counter is stored as a thread-local HashMap<String, i64> keyed by table
name, lazily initialized on the first INSERT:
auto_increment_next(table_name):
if table_name not in thread_local_map:
max_existing = MAX(id) from HeapChain scan, or 0 if table is empty
thread_local_map[table_name] = max_existing + 1
value = thread_local_map[table_name]
thread_local_map[table_name] += 1
return value
The MAX+1 lazy-init strategy means the sequence is always consistent with
existing data, even after rows are inserted by a previous session or after
a crash recovery.
MAX+1
is compatible with either approach.
Explicit Value Bypass
When the INSERT column list includes the AUTO_INCREMENT column with a non-NULL value, the explicit value is used directly and the sequence counter is not advanced:
for each row to insert:
if auto_increment_col in provided_columns:
value = provided value // bypass — no counter update
else:
value = auto_increment_next(table_name)
session.last_insert_id = value // update only for generated IDs
LAST_INSERT_ID() is updated only when a value is auto-generated. Inserting
an explicit ID does not change the session’s last_insert_id.
Multi-Row INSERT
For INSERT INTO t VALUES (...), (...), ..., the executor calls
auto_increment_next once per row. last_insert_id is set to the value
generated for the first row before iterating through the rest:
ids = [auto_increment_next(t) for _ in rows]
session.last_insert_id = ids[0] // MySQL semantics
insert all rows with their respective ids
TRUNCATE — Sequence Reset
TRUNCATE TABLE t deletes all rows by scanning the HeapChain and marking
every visible row as deleted (same algorithm as DELETE FROM t without a
WHERE clause). After clearing the rows, it resets the sequence:
execute_truncate(table_name):
for row in HeapChain::scan_visible(table_name, snapshot):
storage.delete_row(row.record_id, txn_id)
thread_local_map.remove(table_name) // next insert re-initializes from MAX+1 = 1
return QueryResult::Affected { count: 0 }
Removing the entry from the map forces a MAX+1 re-initialization on the next
INSERT. Because the table is now empty, MAX = 0, so next = 1.
SHOW TABLES / SHOW COLUMNS
SHOW TABLES
SHOW TABLES [FROM schema] reads the catalog’s table registry and returns one
row per table. The output column is named Tables_in_<schema>:
execute_show_tables(schema):
tables = catalog.list_tables(schema)
column_name = "Tables_in_" + schema
return QueryResult::Rows { columns: [column_name], rows: [[t] for t in tables] }
SHOW COLUMNS / DESCRIBE
SHOW COLUMNS FROM t, DESCRIBE t, and DESC t are all dispatched to the
same handler. The executor reads the column definitions from the catalog and
constructs a fixed six-column result set:
execute_show_columns(table_name):
cols = catalog.get_table(table_name).columns
for col in cols:
Field = col.name
Type = col.data_type.to_sql_string()
Null = if col.nullable { "YES" } else { "NO" }
Key = if col.is_primary_key { "PRI" } else { "" }
Default = "NULL" // stub
Extra = if col.auto_increment { "auto_increment" } else { "" }
return six-column result set
The Key and Default fields are stubs: Key only reflects primary key
membership; composite keys, unique constraints, and foreign keys are not yet
surfaced. Default always shows "NULL" regardless of the declared default
expression. Full metadata exposure is planned for a later catalog enhancement.
ALTER TABLE Execution
ALTER TABLE dispatches to one of five handlers depending on the operation. Three of them (ADD COLUMN, DROP COLUMN, and MODIFY COLUMN) require rewriting every row in the table. The other two (RENAME COLUMN and RENAME TO) touch only the catalog.
Why Row Rewriting Is Needed
AxiomDB rows are stored as positional binary blobs. The null bitmap at the
start of each row has exactly ceil(column_count / 8) bytes — one bit per
column, in column-index order. Packed values follow immediately, with offsets
derived from the column types declared at write time.
Row layout (schema: id BIGINT, name TEXT, age INT):
null_bitmap (1 byte) [b0=id_null, b1=name_null, b2=age_null, ...]
id (8 bytes, LE i64) [only present if b0=0]
name (4-byte len + UTF-8 bytes) [only present if b1=0]
age (4 bytes, LE i32) [only present if b2=0]
When the column count changes, the null bitmap size changes and all subsequent offsets shift. A row written under the old schema cannot be decoded against the new schema — the null bitmap has the wrong number of bits, and value positions no longer align. Every row must therefore be rewritten to match the new layout.
RENAME COLUMN does not change column positions or types — only the name entry
in the catalog changes. RENAME TO changes only the table name in the catalog.
Neither operation touches row data.
rewrite_rows Helper
ADD COLUMN, DROP COLUMN, and MODIFY COLUMN all use a shared rewrite_rows
dispatch. The implementation branches on storage format:
Heap tables:
rewrite_rows(table_name, old_schema, new_schema, transform_fn):
snapshot = txn.active_snapshot()
old_rows = HeapChain::scan_visible(table_name, snapshot)
for (record_id, old_row) in old_rows:
new_row = transform_fn(old_row)? // apply per-operation transformation
storage.delete_row(record_id, txn_id)
storage.insert_row(table_name, encode_row(new_row, new_schema), txn_id)
Clustered tables (rewrite_rows_clustered):
Clustered tables cannot use heap delete+reinsert because clustered_tree::insert
rejects duplicate primary keys even when the previous row is delete-marked. Instead,
each row is rewritten in place using update_with_relocation:
rewrite_rows_clustered(table_id, old_schema, new_schema, transform_fn):
snapshot = txn.active_snapshot()
rows = clustered_tree::range(table_id, Unbounded, Unbounded, snapshot)
for ClusteredRow { key, row_header, row_data } in rows:
old_row = decode_row(row_data, old_schema)
new_row = transform_fn(old_row)?
new_data = encode_row(new_row, new_schema)
txn.record_clustered_update(table_id, key, row_header+row_data, new_data)
new_root = clustered_tree::update_with_relocation(key, new_data)
if let Some(new_root_pid) = new_root {
catalog.set_root_page(table_id, new_root_pid)
}
update_with_relocation tries an in-place rewrite of the leaf slot. If the new
row is larger and the leaf page is full, it falls back to physical delete + reinsert
at the correct leaf position (no duplicate-key issue because the old entry is
physically removed before the new one is inserted).
The transform_fn is operation-specific and returns Result<Row, DbError> so
coercion failures abort the entire statement:
| Operation | transform_fn |
|---|---|
| ADD COLUMN | Append DEFAULT value (or NULL if no default) to the end of the row |
| DROP COLUMN | Remove the value at col_idx from the row vector |
| MODIFY COLUMN | Replace value at col_idx with coerce(value, new_type, Strict)? |
Ordering Constraint — Catalog Before vs. After Rewrite
The ordering of the catalog update relative to the row rewrite is not arbitrary. It is chosen so that a failure mid-rewrite leaves the database in a recoverable state:
ADD COLUMN — catalog update FIRST, then rewrite rows:
1. catalog.add_column(table_name, new_column_def)
2. rewrite_rows(old_schema → new_schema, append DEFAULT)
If the process crashes after step 1 but before step 2 completes, the catalog already reflects the new schema. The partially-rewritten rows are discarded by crash recovery (their transactions are uncommitted). On restart, the table is consistent: the new column exists in the catalog, and all rows either have been fully rewritten (if the transaction committed) or none have been (if it was rolled back).
DROP COLUMN — rewrite rows FIRST, then update catalog:
1. rewrite_rows(old_schema → new_schema, remove col at col_idx)
2. catalog.remove_column(table_name, col_idx)
If the process crashes after step 1 but before step 2, the rows have already been written in the new (narrower) layout but the catalog still shows the old schema. Recovery rolls back the uncommitted row rewrites and the catalog is never touched — the table is fully consistent under the old schema.
MODIFY COLUMN — rewrite rows FIRST (with strict coercion), then update catalog:
1. Guard: column not in secondary index (type change would break key encoding)
2. Guard: PK column cannot become nullable on clustered table
3. rewrite_rows(old_schema → new_schema, coerce(val, new_type, Strict)?)
4. catalog.delete_column(table_id, col_idx)
5. catalog.create_column(new_ColumnDef) // same col_idx, new type/nullable
If coercion fails for any row (e.g. TEXT → INT on a non-numeric value), the
error is returned immediately and no rows are changed. The statement is atomic:
either all rows are coerced successfully or none are.
The invariant is: the catalog always describes rows that can be decoded. Swapping the order for either operation would create a window where the catalog describes a schema that does not match the on-disk rows.
clustered_tree::insert rejects duplicate primary keys even when the previous
entry is delete-marked. Instead, rewrite_rows_clustered uses
update_with_relocation: it rewrites the leaf slot in place, falling back to
physical relocate-and-reinsert only when the new row is larger and the leaf has
no room. This avoids the duplicate-key restriction entirely and keeps the PK-keyed
B+ tree consistent throughout the rewrite.
Session Cache Invalidation
The session holds a SchemaCache that maps table names to their column
definitions at the time the last query was prepared. After any ALTER TABLE
operation completes, the cache entry for the affected table is invalidated:
execute_alter_table(stmt):
// ... perform operation (catalog update + optional row rewrite) ...
session.schema_cache.invalidate(table_name)
This ensures that the next query against the altered table re-reads the catalog and sees the updated column list, rather than operating on a stale schema that may reference columns that no longer exist or omit newly added ones.
Index root invalidation on B+tree split
The SchemaCache also stores IndexDef.root_page_id for each index. When an
INSERT causes the B+tree root to split, insert_in allocates a new root page
and frees the old one. After this, the cached root_page_id points to a freed
page. If the cache is not invalidated, the next execute_insert_ctx call reads
IndexDef.root_page_id from the cache and passes it to BTree::lookup_in
(uniqueness check), causing a stale-pointer read on a freed page.
The fix: call ctx.invalidate_all() whenever any index root changes during
INSERT or DELETE index maintenance. This forces re-resolution from the catalog
(which always has the current root_page_id) on the next DML statement.
Since 5.19, DELETE and the old-key half of UPDATE no longer mutate indexes in
a per-row loop. They collect exact encoded keys per index, sort them, and call
delete_many_in(...) once per affected tree. The cache-invalidation rule still
matters, but the synchronization point moved:
- batch-delete old keys per index
- persist the final root once for that index
- update the in-memory
current_indexesslice - invalidate the session cache once after the statement
For UPDATE there is a second root-sync point: after the batch delete phase, the reinsertion half must start from the post-delete root, not from the stale root captured before the batch. Otherwise reinserting new keys after a root collapse would descend from a freed page.
#![allow(unused)]
fn main() {
// DELETE / UPDATE old-key batch
let updated = delete_many_from_indexes(...)?;
for (index_id, new_root) in updated {
catalog.update_index_root(index_id, new_root)?;
current_indexes[i].root_page_id = new_root;
}
// UPDATE new-key insert phase
let ins_updated = insert_into_indexes(¤t_indexes, ...)?;
}
Stable-RID UPDATE Fast Path (5.20)
5.19 removed the old-key delete bottleneck, but UPDATE still paid the full
heap delete+insert path even when the new row could fit in the existing slot.
5.20 adds a second branch:
for each matched row:
old_row = ...
new_row = apply_set_assignments(old_row)
if encoded(new_row) fits in old slot:
rewrite tuple in place
rid stays identical
only maintain indexes whose logical key/predicate membership changed
else:
fallback to delete + insert
rid changes
treat affected indexes as before
The heap rewrite path is page-grouped. Rows that share a heap page are batched so
the executor reads the page once, rewrites all eligible slots, then writes the page
once. WAL records this branch as EntryType::UpdateInPlace, storing the old and new
tuple images for the same (page_id, slot_id).
This does not implement PostgreSQL HOT chains or forwarding pointers. The Phase 5 rule is narrower and cheaper to reason about: same-slot rewrite only, otherwise fall back to the existing delete+insert path.
Clustered UPDATE In-Place Zero-Alloc Fast Path (Phase 39.22)
fused_clustered_scan_patch in executor/update_fused_range.rs implements a zero-allocation
UPDATE fast path for clustered tables when all SET columns are fixed-size.
Allocation audit
| Allocation | Before 39.22 | After 39.22 |
|---|---|---|
cell.row_data.to_vec() (phase-1 offset scan) | 1× per matched row | ❌ eliminated |
patched_data = ...clone() (phase-2 mutation) | 1× per matched row | ❌ eliminated |
encode_cell_image() in overflow path | 1× per matched row | ✅ overflow-only |
FieldDelta.old_bytes: Vec<u8> | 1× per changed field | ❌ → [u8;8] inline |
FieldDelta.new_bytes: Vec<u8> | 1× per changed field | ❌ → [u8;8] inline |
For 25K rows with 1 changed column each: ~125K heap allocations → 0.
Two-phase borrow pattern
The Rust borrow checker requires releasing the immutable page borrow before taking a mutable one. The fast path uses a split-phase approach:
Read phase (immutable borrow on page):
1. cell_row_data_abs_off(&page, idx) → (row_data_abs_off, key_len)
2. compute_field_location_runtime(row_slice, bitmap) → FieldLocation
3. MAYBE_NOP: if page_bytes[field_abs..][..size] == new_encoded[..size] { skip }
4. Capture old_buf: [u8;8] and new_encoded: [u8;8] on the stack
Write phase (mutable borrow after immutable dropped):
5. patch_field_in_place(&mut page, field_abs, new_buf[..size])
6. update_row_header_in_place(&mut page, idx, &new_header)
MAYBE_NOP (byte-identity check)
If the new encoded bytes are byte-identical to the existing page bytes
(e.g., SET score = score * 1 after integer multiplication), the field is
skipped entirely — no WAL delta, no header bump, no page write for that field.
This is an O(size) byte comparison (~4–8 bytes) before any mutation.
Overflow fallback
Cells with overflow_first_page.is_some() are rare (<1% of typical workloads)
and fall back to the existing rewrite_cell_same_key_with_overflow path
(full cell re-encode). The fast path only applies to inline cells.
btr_cur_upd_rec_in_place) still allocates an undo record per row for ROLLBACK support. AxiomDB's UndoClusteredFieldPatch stores undo data as inline [u8;8] arrays in the undo log entry — zero heap allocation per row even for ROLLBACK support. For a 25K-row UPDATE t SET score = score + 1, this reduces allocator pressure from ~125K allocs to zero.
Strict Mode and Warning 1265
SessionContext.strict_mode is a bool flag (default true) that controls
how INSERT and UPDATE column coercion failures are handled.
Coercion paths
INSERT / UPDATE column value assignment:
if ctx.strict_mode:
coerce(value, target_type, CoercionMode::Strict)
→ Ok(v) : use v
→ Err(e) : return Err immediately (SQLSTATE 22018)
else:
coerce(value, target_type, CoercionMode::Strict)
→ Ok(v) : use v (no warning — strict succeeded)
→ Err(_) : try CoercionMode::Permissive
→ Ok(v) : use v, emit ctx.warn(1265, "Data truncated for column '<col>' at row <n>")
→ Err(e): return Err (both paths failed)
CoercionMode::Permissive performs best-effort conversion: '42abc' → 42,
'abc' → 0, overflowing integers clamped to the type bounds.
Row numbering
insert_row_with_ctx and insert_rows_batch_with_ctx accept an explicit
row_num: usize (1-based). The VALUES loop in execute_insert_ctx passes
row_idx + 1 from enumerate():
#![allow(unused)]
fn main() {
for (row_idx, value_exprs) in rows.into_iter().enumerate() {
let values = eval_value_exprs(value_exprs, ...)?;
engine.insert_row_with_ctx(&mut ctx, values, row_idx + 1)?;
}
}
This makes warning 1265 messages meaningful for multi-row inserts:
"Data truncated for column 'stock' at row 2".
SET strict_mode / SET sql_mode
The executor intercepts SET strict_mode and SET sql_mode in execute_set_ctx
(called from dispatch_ctx). It delegates to helpers from session.rs:
#![allow(unused)]
fn main() {
"strict_mode" => {
let b = parse_boolish_setting(&raw)?;
ctx.strict_mode = b;
}
"sql_mode" => {
let normalized = normalize_sql_mode(&raw);
ctx.strict_mode = sql_mode_is_strict(&normalized);
}
}
The wire layer (handler.rs) syncs the wire-visible @@sql_mode and
@@strict_mode variables with the session bool after every intercepted SET.
Both variables are surfaced in SHOW VARIABLES.
'42' →
42) never generate a warning in either mode — matching MySQL 8's
behavior where warning 1265 is reserved for actual data loss, not clean widening.