May 2026 · CMHT Technology Research · v1.6
app/models.py), and static-per-target (template substitution with no LLM call, ~10 files). The reclassification reduced LLM probabilistic surface by ~65% and made six previously-recurrent bug classes structurally impossible. A comprehensive hardening pass (v1.10/v1.10.1, Phases 47–48) added PostgreSQL native types via dialect flag, content-pinned all 6 deployment files, hardened the AST parser, unified Python interpreter resolution under pyenv with fail-fast on system python, and added a portability ratchet meta-test. A subsequent formalization (v1.10.2, SPEC v1.1.0, Phase 50) closed the inter-Gate artifact contract gap: the four Gate 1–4 YAML artifacts (adl.core.yaml, adl.schema.yaml, adl.policy.yaml, adl.effects.yaml) plus the Gate 0 context artifact now have authoritative JSON Schemas validated at write time, with closed enums for field types (11 values) and effect types (13 values), and explicit drift-rejection patterns that refuse LLM-smuggled commentary in scalar value fields. Test count grew from 193 (v1.0 forward run) to 1621+ (v1.10.2). The methodological contribution is a four-class hierarchy of work — focused fix, architectural reclassification, audit pass, post-audit self-correction — each with its own discipline, that emerged from the empirical record of the project's own development; Phase 49–50 extended the four-class pattern from code to tooling and contract surfaces.
The standard model of software development requires human developers to translate intent into code — a process that is inherently lossy, difficult to validate, and impossible to formally verify at the intent level. Large language models have partially disrupted this model by enabling code generation from natural language descriptions, but a fundamental gap remains: natural language intent is ambiguous, incomplete, and provides no mechanism for bidirectional validation between what was specified and what was built.
This paper argues that a formal intermediate layer — a specification language positioned between human intent and autonomous execution — can close this gap. The central claim is narrow and precise: a two-primitive formal specification language can bridge the gap between human intent and autonomous software construction, validated empirically in both forward and reverse directions across six technology stacks.
The claim requires validation in both directions. Forward validation demonstrates that the specification language is sufficient to drive correct autonomous construction. Reverse validation demonstrates that the specification language captures real domain structure. The combination establishes that ADL encodes domain knowledge in a form that both drives construction and measures coverage.
Beyond the original validation, this paper additionally reports on the architectural and methodological evolution of ADL itself between v1.0 and v1.10.2 — a development arc in which the methodology produced findings about its own discipline. Section 8 introduces the four-class hierarchy of work that emerged from the empirical record: focused fix, architectural reclassification, audit pass, and post-audit self-correction. Each class has its own cost profile and methodological yield. Section 9 reports the inter-Gate artifact contract formalization (SPEC v1.1.0) that closed a long-standing gap between aspirational prose (the SPEC claimed a validator existed) and pipeline reality (no validator existed for Gate 1–4 outputs); the formalization is itself a worked example of audit-pass discipline applied to specification rather than to code.
ADL models a software domain using exactly two constructs: Entity and Transition.
An Entity is a persisted domain object with identity, discrete states, field definitions, ownership relationships, and optionally a lifecycle pattern. The entity grounding rule (v0.4) requires entities to be persisted first-class domain objects — not computed views, DTOs, or infrastructure concerns.
A Transition is a state change on an Entity with a trigger (the causal event), an actor (who can initiate it), optional guards (preconditions), and optional effects (what else happens when it fires).
entity: Document [ states: draft, pending_review, approved, published, archived owner: Folder evidence: models/document.py:Document ] transition: Document.draft → Document.pending_review [ trigger: submit_for_review actor: Author ]
ADL describes intent, not implementation. The specification defines what transitions exist; the pipeline gates decide how they are implemented per target stack. This delegation makes ADL portable across technology stacks.
v0.3: Core entity/transition primitives, 6-gate pipeline, basic field types.
v0.4: Entity grounding rule added. Reverse accuracy against SPENDCITY: ~9% manual.
v0.5: Six constructs from reverse validation gaps: extends, pattern, unique_together, async_job extensions, scheduled_job, system_effects.
v0.6: Eight constructs from TOPOS and LIITOS-AI mobile validation findings: adl.analytics.yaml, cache_ttl split, stub: true marker, extraction_effect, config_sources, derived_entity, initialization_effect, async: true on api_call.
v1.0: Gate 5.5 (Scaffold Validator) introduced. Working-app contract revised to certify validated formal spec plus consistent scaffold. LINETBOOT-USB unseen-domain forward validation established the v1.0 standard.
v1.4: Gate 0 dynamic elicitation — replacing hardcoded questions with LLMAPI-generated domain-specific questions. Sixth technology stack (Next.js/Prisma/TypeScript) validated.
v1.5: Architectural reclassification of Gate 5 file generation (Phase 46/v1.9) — the most consequential structural change in the project's history. Plus comprehensive hardening (Phase 47/v1.10) and post-audit self-correction (Phase 48/v1.10.1). Six bug classes structurally eliminated. Test count grew from 678 to 1621. PostgreSQL native types added via dialect flag. Section 5 details the reclassification; Section 8 details the methodological findings.
v1.6 (this paper): Inter-Gate artifact contract formalization (Phase 50/v1.10.2). Five JSON Schemas at cli/schemas/ (one per Gate 0–4 artifact) become authoritative for the inter-gate handoff; SPEC v1.1.0 prose mirrors them and CI tests enforce non-divergence. Closed enums for entities[].fields[].type (11 values) and effects[].type (13 values), empirically derived from 30+ pipeline runs. A cleanString pattern rejects six classes of LLM-smuggled commentary in scalar value fields (backticks, inline # comments, [] placeholders, (or ...) qualifiers, parenthesized italics, leading #). The runtime validator at cli/src/spec-validator/ runs at write time after every gate; structurally invalid artifacts halt the run and are preserved at ~/.adl/runs/<run-id>/invalid/ for forensic review. Free-text by design: transition.condition (natural language preconditions) and transition.actor (open role names). Section 9 details this formalization. The codebase-mapper authoritative-context fix (Phase 49) is folded into Section 8.4.
UML and BPMN represent prior approaches to formal specification. Neither was designed for autonomous construction: they specify structure for human interpretation, not for machine-driven code generation with validation. ADL differs in that every construct has a defined consumer in the gate architecture, and the two-primitive constraint forces completeness sufficient for autonomous elaboration.
OpenAPI and AsyncAPI operate at the interface level — they specify endpoints and message formats, not domain models or behavioral contracts. ADL operates upstream of API specification, encoding the domain semantics that OpenAPI cannot express.
GitHub Copilot, Cursor, and Devin accept natural language intent and produce code but provide no formal intermediate representation for validation. ADL addresses this gap: the specification serves as a reference point for both forward construction and reverse coverage measurement.
The ADL pipeline is an 8-stage sequential chain. No natural language passes between gates after Gate 0.
Seed sentence
│
▼
Gate 0 — Elicitation [LLMAPI consensus] ──▶ intent.md + adl.context.yaml
│
▼
Gate 1 — Designer [consensus: thorough] ──▶ adl.core.yaml
│
▼
Gate 2 — Schema [consensus: balanced] ──▶ adl.schema.yaml
│
▼
Gate 3 — Policy [consensus: thorough] ──▶ adl.policy.yaml
│
▼
Gate 4 — Effects [consensus: balanced] ──▶ adl.effects.yaml
│
▼
Gate 5 — Target [consensus: fast] ──▶ scaffolding/ (entity-derived files only, post-v1.9)
│
▼
Deterministic injection [no LLM] ──▶ scaffolding/ (+ 11 files)
│
▼
Gate 5.5 — Validator [deterministic] ──▶ validated scaffolding/
│
▼
Gate 6 — Delta [consensus: balanced] ──▶ divergence report
Gate 0 — Elicitation (v1.3): User provides a one-sentence seed description. Five LLMAPI models generate 8–15 domain-specific questions calibrated to complexity signals (state machines, external APIs, IoT, mobile, multi-tenant, deployment topology). Hardcoded questions serve as offline fallback.
Gate 1 — Designer: Core domain modeling. Produces adl.core.yaml. Most consequential gate — entity identification errors propagate through all subsequent gates.
Gate 5.5 — Scaffold Validator: Post-generation validation. Runs Python ast.parse against all generated files (via the unified resolveAdlPython() helper, which fails fast if no pyenv interpreter is found); replaces syntax-error files with valid stubs; verifies entity coverage from adl.core.yaml; applies bidirectional model/migration parity check (Phase 45) — every mapped_column() in app/models.py must correspond to a sa.Column() in the migration's op.create_table(), and vice versa. Deterministic — no LLM calls.
Each gate queries five independent LLMs through a unified LLMAPI subprocess transport. Outputs are compared using Jaccard similarity with a 3-of-5 threshold. Below-threshold disagreement surfaces as an open question — LLM disagreement is information.
The v1.0 Gate 5 architecture decomposed scaffolding generation into 4 sequential LLM passes with context chaining. This resolved the LLM output length limit that prevented single-pass generation of full-stack scaffolds. Result: 100% file coverage where single-pass had achieved 38%. The multi-pass architecture remained the dominant pattern through v1.8.
The most consequential single architectural change in the project's history. By v1.8 the pipeline had accumulated a recurring bug pattern: model/migration column drift (Bug 15), deployment-file underemission (Bugs 13, 16), requirements drift (Bug 9), hardcoded import errors (Bug 10). Each bug had been closed individually via a focused fix — a new validator rule, tightened template guidance, a pinning test. The same shape kept reappearing in new forms.
A static analysis pass over cli/src/gate5/pass-groups/pass-groups.js, cli/src/runner.js, and the fastapi_generic.md template's own annotations revealed the structural cause. Of the ~20 files Gate 5 generated, only ~7 were entity-derived. The rest split into two categories the system was treating identically:
Dockerfile, docker-compose.yml, docker-entrypoint.sh, .env.example, .dockerignore, QUICKSTART.md, requirements.txt, app/database.py, alembic.ini, alembic/env.py): zero entity-derived content. The template documented these explicitly as "static per-target with placeholders filled from ADL context" but they passed through the LLM probabilistic path anyway.alembic/versions/001_initial.py): a closed-form AST transform of app/models.py. Generating it through a separate LLM pass meant the model and the migration could disagree — and they did, repeatedly. Bug 15 (column drift) was the structural inevitability of this misclassification.Phase 46 implemented the reclassification:
FASTAPI_PASS_GROUPS. A new injectDeploymentFiles(allFiles, target, context) function emits all 6 deployment files deterministically post-Gate-5.buildAlembicContent was rewritten to AST-walk app/models.py via the shared parseModelsForParity helper. The Alembic migration is now a deterministic transform of the LLM-emitted model.injectStaticFiles was extended to emit requirements.txt (synced verbatim with the template), app/database.py, alembic.ini, and alembic/env.py. Each was removed from the pass groups.The result: Gate 5 LLM scope dropped from ~20 files to ~7 entity-derived files. The probabilistic surface reduced by ~65%. The Phase 45 model/migration parity validator was retained as a ratchet for the residual entity-derived class.
Six bugs (9, 12, 13, 15, 16, plus the migration leg of 17) became structurally impossible to recur — not fixed, not validated against, but removed from the path that produced them. This is the methodologically significant difference between focused fix and architectural reclassification: the former catches a bug instance, the latter eliminates the bug class.
Four enterprise applications were built from ADL models by autonomous agents during the v1.0 forward validation campaign. Target platform: Express + Prisma + TypeScript + SQLite.
| Application | Entities | Domain |
|---|---|---|
| DocFlow | 9 | Document lifecycle management |
| VisitPass | 8 | Visitor and access management |
| AssetVault | 8 | Physical/digital asset tracking |
| OnboardKit | 15 | Employee onboarding orchestration |
Results: 193 passing tests, ~7,000 lines of code, 12 UI variant files, 4 Swagger API docs — entirely from ADL specifications with zero human coding after intent authoring. Multi-pass Gate 5 file coverage improved from 38% (single-pass) to 100% (multi-pass). Subsequent forward validations against vehiclemaint and uatdemo (FastAPI/PostgreSQL, Phase 36 deploy contract) added empirical deploy-path validation: adl package --deploy ships scaffolds to a configured VPS subdomain, and the v1.10.1 codebase is empirically portable to a fresh clone (verified by a portability ratchet meta-test that walks every .js file under cli/src/ and asserts no hardcoded user paths outside three documented exemption categories).
SPENDCITY is a production Python backend (FastAPI + PostgreSQL + SQLAlchemy + Celery + Gemini OCR) with 77 domain entities across 118 SQLAlchemy model classes. ADL pipeline operated read-only, intent-only.
Results: 19.5% manual entity recovery accuracy (v0.5). The 62 missed entities were never formally specified — they accumulated as implementation decisions. This is the correct pipeline behavior: the reverse pipeline measures specification coverage, not capability ceiling.
The accuracy ceiling insight: reverse accuracy is bounded by intent coverage, not spec expressiveness. The 19.5% is a diagnostic about SPENDCITY's development process.
ADL reads back code it generated. Entity recovery: 4/4 = 100%. Spec fidelity: 97%. The 3-point gap reflects Gate 2 naming conventions (name→title, text→content) — systematic and stack-independent.
Entity recovery: 4/4 = 100%. Spec fidelity: 93%. The 7-point gap reflects Dart DTO suffix patterns and type-erased status fields. Dart source parser extension followed the same pattern as Python: regex-based extraction, no gate logic changes.
Production AI/ML frontier tracker (FastAPI + PostgreSQL + Gemini API, 10 domain entities). Intent-only: 0% (planning meta-docs dominated Gate 1). With-source: 100% (10/10 entities). Confirmed CLASS_RE generalization to Base/DeclarativeBase SQLAlchemy patterns.
First mobile-first validation. Intent-only: 0% (meta-doc co-location). With-source: 50% (4/8 entities). The 50% result is a measurement artifact: Gate 1 correctly classified domain entities; the ground truth denominator was inflated by TypeScript utility types (Coords, SortOrder, LocationResult, ExtractionResult) that are not ADL-style domain entities.
Finding: the entity grounding rule is language-paradigm dependent. ORM inheritance provides a clean structural boundary; TypeScript's type system does not.
Fully unseen enterprise domain (6 entities, 29 transitions, USB-based OS deployment management). All domain knowledge from 12 adl init questions. Eight attempts to 7/7 contract criteria.
Attempts 1-5 scored 0/7 on the original bootable-app contract, revealing that stateless LLM generation cannot guarantee cross-file semantic consistency at this domain complexity. Contract was revised to reflect what ADL v1.0 reliably delivers: a validated formal spec plus consistent scaffold. Gate 5.5 introduced. Attempt 8: 7/7.
| Criterion | Result |
|---|---|
| Spec artifacts valid (4 YAML files) | PASS |
| Scaffold compiles (0 syntax errors) | PASS |
| Entity models and schemas present | PASS |
| Auth pattern present (JWT) | PASS |
| Test scaffolding covers all entities | PASS |
| Bootstrap dependencies installable | PASS |
| Gate 6 reports ACCURATE | PASS |
First validation using Gate 0 dynamic elicitation on a real production domain. Seed: "An AI-powered news tracker that aggregates RSS feeds, synthesizes briefings with AI, and delivers personalized content to users." Gate 0 generated 14 domain-specific questions including integrations, runtime_config, notifications, data_visibility, and deployment_topology.
Four attempts to 7/7. Each attempt surfaced a targeted fix: (1) static package.json injection, (2) Jest tests for TypeScript targets, (3) target-aware Gate 5.5 retries, (4) post-Gate 5.5 language filter removing Python contamination files.
Final scaffold: 30 TypeScript files + 1 Prisma schema + 10 Jest test stubs + package.json. NextAuth.js + JWT + bcrypt auth pattern. 7/7 contract criteria.
| Codebase | Stack | Type | Result |
|---|---|---|---|
| SPENDCITY | Python/FastAPI | Reverse uncontrolled | 19.5% intent-only, 100% with-source |
| LIITOS-AI | Python/FastAPI | Reverse uncontrolled | 0% intent-only, 100% with-source |
| TaskFlow | Python/FastAPI | Forward controlled | 97% spec fidelity |
| TaskFlow | Flutter/Dart | Forward controlled | 93% spec fidelity |
| TOPOS | React Native/TS | Forward uncontrolled | 50% with-source (measurement artifact) |
| LINETBOOT-USB | Python/FastAPI | Forward unseen domain | 7/7 contract (8 attempts) |
| LIITOS-AI | Next.js/Prisma | Forward new stack | 7/7 contract (4 attempts) |
| vehiclemaint | FastAPI/PostgreSQL | Forward + deploy | v1.10.1 contract + content-pinned deployment |
The development arc from v1.0 to v1.10.1 produced findings about the methodology's own discipline. The empirical record reveals four distinct classes of work, each with its own cost profile, leverage, and discipline:
One bug, one closure, one validator rule, one pinning test. Empirically driven by a specific failure surface. Cheapest, narrowest leverage. Phases 42-08 through 45 were focused fixes — each closed a Bug-N instance via a tightened template, a new validator rule, and a pinning test. The cumulative effect across many focused fixes is that a recurring bug pattern becomes visible only after enough instances.
One pattern across multiple bugs, one structural change that removes the class from existence. Empirically driven by recurring failure shape, surfaced through static codebase analysis rather than through new failures. Phase 46 (Section 5) is the canonical example — the file-generation reclassification eliminated six bug classes simultaneously by classifying scaffold output by determinism rather than by pass number. Medium cost, highest leverage. The methodologically distinguishing property: the bug class is removed from the path, not caught after the path produces it.
Full concern inventory, scope decision per item with explicit defer reasons, hardening calibrated to a known transition rather than to a known bug. Empirically driven by anticipated load — a new consumer about to start using the system — rather than by an existing failure. Phase 47 (v1.10) is the canonical example: 17 documented concerns from a static codebase analysis, of which 9 were closed (each with empirical justification) and 9 deferred (each with documented reason: inherent, won't trigger under anticipated load, workaround adequate, manual cadence is correct). The methodologically distinguishing property: the deferral list is as important as the closure list. A mature audit pass distinguishes empirical pressure from speculative hardening.
Triggered when an audit pass is empirically falsified by a discovered miss. Phase 48 (v1.10.1) is the canonical example: the v1.10 audit was framed as standby-ready but was empirically falsified hours later when four real bugs were found in the audited codebase, each knowable from static analysis the audit had not done. The class produces three deliverables: (a) immediate fixes for the missed bugs, (b) ratchet tests preventing the missed class from recurring (Phase 48's portability.test.js meta-test that walks every .js file under cli/src/ and self-checks both directions of its classifier), and (c) a discipline upgrade addressing the meta-failure mode that allowed the audit to miss the class.
Phase 48's discipline upgrade: audit-pass discipline must include standing-rule verification. The Phase 47 audit had optimized for empirical concerns documented in a project's CONCERNS.md but had not walked the rule inventory in the standing CLAUDE.md file. Bug 21 (a system-python fallback against the v1.4.0 Python Environment Discipline rule) was the canonical violation: the resolver code was technically correct, every test passed, every concern document had been scoped — and a standing rule was being violated anyway. Future audit passes must walk the rule inventory and check the codebase against each rule explicitly.
Phase 49 (v1.10.1) is a second instance of the same class, applied to tooling rather than code. A run of the project's codebase-mapper agent produced a CONCERNS.md anchored to v1.6/v1.7 state — claiming Phase 46 reclassification not done, deployment pass still in FASTAPI_PASS_GROUPS, system-python fallback still present — despite the actual code being at v1.10.1. Root cause: the mapper agent's prompt instructed it to read package manifests, source-file imports, and TODO/FIXME comments, but not the project's authoritative status documents under .planning/. ADL's source contained 122+ historical references to closed phases (Phase 39–45); the agent treated them as current state. The fix mirrored Phase 48's pattern: (a) revert the bad map and re-anchor, (b) add a structural ratchet (a new STATUS.md as machine-readable single source of truth, plus an authority-ordering rule in the agent prompt establishing that .planning/ content outranks source-code comments for status claims, plus <required_reading> blocks in the workflow that force the mapper to read STATUS.md before running), (c) the discipline upgrade: tools that read code as authoritative for project state will drift if project state lives elsewhere; the fix is making the canonical state explicit, not making the tool smarter. Phase 50's contract formalization (Section 9) follows the same pattern at a third level: where Phase 48 corrected code-side discipline and Phase 49 corrected tooling-side discipline, Phase 50 corrected the specification's discipline against itself — closing the gap between SPEC.md's aspirational prose and the runtime's actual contract enforcement.
The four classes form a hierarchy of cost and methodological maturity. Focused fix is cheapest and narrowest. Architectural reclassification is medium cost and highest leverage. Audit pass is medium cost and surveys the full inventory. Post-audit self-correction is rarest but highest methodological yield because each instance produces both fixes AND a discipline upgrade that prevents that meta-failure class from recurring.
Conflating the classes produces predictable failure modes. Treating every recurring bug as a focused fix produces a local-optimization treadmill — validator-rule-per-bug forever, never naming the architectural cause. Treating every audit pass as comprehensive without standing-rule verification produces hidden discipline violations. Treating every post-audit miss as just-another-bug produces no discipline upgrade and the same audit-class miss recurs in the next cycle.
Through v1.10.1 the SPEC.md document contained an aspirational claim: "Each gate emits a YAML artifact with a fixed schema. The validator checks conformance before passing to the next gate." A peer review surfaced the empirical reality — no such validator existed for the four Gate 1–4 YAML artifacts. The pipeline had shipped six releases, four forward applications, two reverse validations, and an HTML whitepaper without anything failing at write time when a gate produced a structurally invalid artifact. Drift between the prompt-described output shape and the SPEC.md-described format was invisible until something downstream broke.
The audit-pass discipline (Section 8.3) explicitly forbids closing speculative concerns. But this was not speculative: a static walk of ~/.adl/runs/ across recent runs surfaced six concrete drift patterns in Gate 1–4 output, all caused by LLMs smuggling commentary into scalar value fields. Field-type values appeared as string # JSON field for breakdown details (inline comment), string (or int) (parenthesized qualifier), `bool` (markdown backticks), [] (empty placeholder for "no value"). Effect-type values appeared as [] # No side effects when ticket is closed. None of these passed any sane parser; downstream tolerances absorbed them silently.
Phase 50 closed the gap by formalizing the inter-Gate contract as five JSON Schemas:
cli/schemas/adl-context.schema.json (Gate 0 — tech profile) cli/schemas/adl-core.schema.json (Gate 1 — entities + transitions) cli/schemas/adl-schema.schema.json (Gate 2 — detailed schema) cli/schemas/adl-policy.schema.json (Gate 3 — authorization rules) cli/schemas/adl-effects.schema.json (Gate 4 — side effects)
Closed enums. Empirically derived from 30+ pipeline runs across DocFlow, VisitPass, AssetVault, OnboardKit, vehiclemaint, uatdemo, LIITOS-AI, LINETBOOT-USB, and TaskFlow:
| Field | Closed values |
|---|---|
entities[].fields[].type | 11 values: string, int, bool, datetime, decimal, enum, ref, belongs_to, has_one, has_many, many_to_many |
effects[].type | 13 values: async_job, email, external_call, internal_call, webhook, scheduled_job, audit_log, notification, push_notification, api_call, db_write, extraction_effect, initialization_effect |
policy_engine | opa, casbin, custom |
effect.method | GET, POST, PUT, PATCH, DELETE |
rule.effect, test.expected | allow, deny |
Drift rejection (the cleanString pattern). Every string field in every schema references a shared $def that rejects six empirically-observed LLM smuggling patterns: backticks anywhere in the value, inline # comments preceded by whitespace, leading # comments, [] placeholders, (or ...) qualifiers, and parenthesized italic markers like *(or `int`)*. These patterns were chosen by reading the actual broken outputs in ~/.adl/runs/, not by guessing what an LLM might emit. Rejecting them forces Gate prompts to emit clean values; the alternative (accepting drift) silently codifies LLM noise as the contract.
Free-text by design. Two fields remain free-text and are documented as such in both the schemas and SPEC.md §Inter-Gate Artifact Schemas:
transition.condition — free-form natural language describing the precondition. Empirical examples include "Within 24 hours of completion", "actor is manager", and "BootConfig not assigned to any DeploymentJob in (pending, in_progress)". A formal expression language (CEL, Rego, custom DSL) is deferred until empirical evidence shows the free-text form fails downstream consumers.transition.actor — bare role name. The empirical set (Admin, User, System, Author, Reviewer, Manager, Member, Technician) is observed but not closed; new role names are accepted on first use. Parenthesized modifiers like User (Admin) are rejected as drift; if a modifier is needed, a separate actor_modifier field is the correct path forward (Phase 51+ candidate).Schema authority direction. The JSON Schemas are authoritative. SPEC.md prose mirrors them and must stay synchronized; if they diverge, the schema wins and SPEC.md is the bug. CI enforces this via a schema-vs-spec-prose drift test. This direction matters: when the contract is in machine-readable form first and human-readable form second, the contract cannot be ambiguous about its closed sets.
Runtime integration. The validator at cli/src/spec-validator/ uses ajv 8.x (with ajv-formats) and runs at write time after every Gate 1–4 emission. Structurally invalid artifacts halt the run and are preserved at ~/.adl/runs/<run-id>/invalid/<gate>/<artifact>.yaml for forensic review — not silently retried, not silently overwritten, not silently propagated. The forensic preservation pattern matches the discipline established in Phase 47.4 for portability ratchets: when something fails the contract, capture the evidence rather than discard it.
Stop-trigger relationship. The ADL project's CLAUDE.md has a stop trigger: "Gate interface changes (any change to gate input/output contracts)." Phase 50 deliberately satisfies this stop trigger rather than violating it: the JSON Schemas are the contract, and SPEC v1.1.0 documents what the contract enforces. The change is not from "old contract" to "new contract" — it is from "undocumented implicit contract that drifted under LLM noise" to "explicit contract enforced at write time." The stop trigger remains hot for any future change to the schemas themselves.
Methodological pattern. Phase 50 is an audit-pass discipline applied to specification rather than code. Section 8 names four work classes (focused fix, architectural reclassification, audit pass, post-audit self-correction). Phase 50 demonstrates that the audit-pass class generalizes: the inventory was the SPEC's claims about itself, the comparison was against pipeline reality, and the closure was a structural change (machine-readable contract) rather than a series of point fixes. The same inversion applied to Phase 49 (codebase-mapper authoritative-context fix — Section 8.4): the inventory was the mapper's tooling-prompt, the comparison was against the project's actual state in STATUS.md, the closure was an authority-ordering rule that prevents the tool from drifting from reality.
Intent-coverage ceiling. Reverse accuracy is bounded by intent document completeness. The 19.5% SPENDCITY result reflects intent coverage. Codebases with no specification documents yield near-zero accuracy.
Measurement validity. Accuracy figures from ORM-backed stacks and TypeScript stacks are not directly comparable. ORM inheritance provides unambiguous ground truth; TypeScript type systems include utility types that inflate the denominator.
Forward validation limited to scaffolding contract. The v1.0 contract certifies spec artifacts, scaffold syntax, entity coverage, auth pattern, test scaffolding, and dependency installability. It does not certify a zero-intervention bootable application. The v1.5 deploy contract additionally certifies VPS deployment via adl package --deploy and content-pinned deployment artifacts.
Test generation coverage. Gate 5 generates minimum viable tests — happy path per entity. Branch coverage, edge cases, and error paths are the developer's responsibility.
Agent-written test circularity. Tests were written by the same class of agents that generated the code. Wave separation and spec-model test context mitigate circularity but tests have not been independently validated by human developers.
Frontend out of scope. All current targets generate API-only scaffolds. Frontend code (React, Vue, mobile UI) is the consumer's responsibility.
Semantic accuracy metric. Replacing exact name matching with embedding-based or LLM-judged semantic matching would produce more accurate automated accuracy figures at the cost of determinism.
TypeScript ground truth classification. Entity classification heuristics for TypeScript (field count, field type richness, identifier field presence) would distinguish domain entities from utility types, producing accuracy figures comparable to ORM-backed stacks.
Architectural reclassification of remaining targets. Phase 46 reclassified fastapi_generic only. The same discipline applied to nextjs_prisma, flutter_dart, and other targets is expected to produce comparable LLM-surface reductions and bug-class eliminations.
Type/nullability/index parity. The v1.10 model/migration parity validator enforces name parity (column names match across model and migration). Type parity (String vs Integer), nullability parity, and index parity are out-of-scope for v1.10 and queued for future versions.
Multi-migration composition. The v1.10 deterministic migration emission produces a single 001_initial.py. Composing multiple migrations across schema evolution is queued.
Adapter ecosystem. Formalizing the dispatch adapter contract would enable third-party execution platforms to integrate ADL as a specification layer.
This paper presented ADL, a two-primitive formal specification language, evaluated in both forward and reverse directions across six technology stacks and seven validation targets, with a substantial architectural and methodological evolution from v1.0 to v1.10.2.
The forward evaluation produced enterprise applications — 40+ entities, 1621+ tests in the current codebase, full autonomous construction from formal specifications. The multi-pass Gate 5 architecture (v1.0) resolved LLM output length constraints; the architectural reclassification (v1.9) reduced LLM probabilistic surface by ~65% and eliminated six bug classes from existence. Gate 0 replaced hardcoded elicitation with dynamic domain-calibrated question generation. The inter-Gate artifact contract formalization (v1.10.2, SPEC v1.1.0) closed a long-standing gap between aspirational prose and runtime reality: five JSON Schemas are now authoritative for the inter-gate handoff, with closed enums for field types and effect types, and explicit rejection of LLM-smuggled commentary in scalar value fields.
The reverse evaluation established the structural asymmetry: forward accuracy is bounded by specification completeness, while reverse accuracy is bounded by intent coverage. The 19.5% SPENDCITY result is a diagnostic about specification coverage — not a pipeline capability ceiling. The TOPOS 50% result exposed a measurement validity finding: ground truth classifiers for TypeScript type systems require additional entity classification heuristics. The LINETBOOT-USB eight-attempt journey established the v1.0 standard: honest contract definition, architectural response to failures, and verified final result.
Beyond the language and pipeline contributions, the v1.0 → v1.10.2 development arc produced a methodological finding: there are at least four distinct classes of work — focused fix, architectural reclassification, audit pass, and post-audit self-correction — each with its own discipline. Conflating them produces predictable failure modes (local-optimization treadmills, hidden discipline violations, missed audit classes recurring in the next cycle). Distinguishing them is itself a methodology that emerges from the empirical record of a project's own development. Phase 49–50 demonstrated that the four-class pattern generalizes beyond code: the same audit-pass discipline applies to tooling (Phase 49 closed a codebase-mapper drift via authoritative-context injection) and to specification (Phase 50 closed the SPEC-vs-runtime contract gap via machine-readable schemas). The discipline is not specific to code maintenance; it applies anywhere a project has a contract that can drift from its enforcement.
The contribution is narrow but precise: a minimal formal specification language can drive autonomous software construction to produce validated enterprise scaffolding from a one-sentence description, that same language provides a quantitative measure of specification coverage when applied to existing systems — validated across six paradigmatically distinct technology stacks — and the methodology of building such a system has its own empirical structure, discoverable by reading the project's own commit history. As autonomous agent capabilities grow, the specification problem and the methodology problem both become more important, not less. ADL v1.10.2 is one answer to both.
ADL Project, May 2026.