April 2026
This paper presents ADL (Application Description Language), a formal specification language that encodes software domains using exactly two primitives — entities and state transitions — and evaluates it empirically in both the forward direction (specification to working software) and the reverse direction (existing software to specification coverage measurement). In the forward direction, four enterprise applications were constructed autonomously from ADL specifications by a multi-agent pipeline with 5-model consensus validation, producing 193 passing tests and approximately 7,000 lines of code with zero human coding after intent authoring. A multi-pass Gate 5 architecture was introduced to overcome LLM output length limits, improving file generation coverage from 38% (5/13 files) to 100% (13/13 files). In the reverse direction, the pipeline was applied across four technology stacks: a production Python backend with 77 domain entities (Node.js/Express, uncontrolled), controlled round-trip proofs against the same application in Python/FastAPI (97% spec fidelity) and Flutter/Dart (93% spec fidelity), and a React Native/Expo mobile application (TOPOS) achieving 0% intent-only accuracy and 50% with-source accuracy. The uncontrolled test against SPENDCITY achieved 19.5% entity recovery accuracy, revealing a structural asymmetry: forward accuracy is bounded only by specification completeness, while reverse accuracy is bounded by intent coverage. The two controlled proofs demonstrate that ADL recovers its own output with high fidelity across paradigmatically distinct technology stacks. A second uncontrolled reverse validation against LIITOS-AI (a personal AI/ML frontier tracker, FastAPI + PostgreSQL + SQLAlchemy + Gemini API, 10 domain entities) achieved 100% entity recovery in with-source mode. A third uncontrolled validation against TOPOS (a React Native/Expo POI finder, TypeScript, 8 domain entities) achieved 50% with-source accuracy, exposing that the entity grounding rule is language-paradigm dependent: ORM-backed stacks produce unambiguous ground truth while TypeScript type systems span utility types and domain entities indiscriminately, inflating the ground truth denominator rather than revealing a pipeline limitation. Together these four proofs establish four-stack bidirectionality as an empirical result. A fifth validation — an eight-attempt forward run against LINETBOOT-USB, a fully unseen enterprise domain (6 entities, 29 transitions) — constitutes the v1.0 declaration. Attempts 1-5 scored 0/7 on the original bootable-app contract, motivating a contract redefinition and the introduction of Gate 5.5 (Scaffold Validator), which adds syntax checking, stub fallback for truncated files, and per-transition-group sub-passing. Attempt 8 scored 7/7 on the revised contract — a validated formal spec plus consistent scaffold from 12 questions in under 30 minutes — establishing the v1.0 release standard.
The standard model of software development requires human developers to translate intent into code — a process that is inherently lossy, difficult to validate, and impossible to formally verify at the intent level. Large language models have partially disrupted this model by enabling code generation from natural language descriptions, but a fundamental gap remains: natural language intent is ambiguous, incomplete, and provides no mechanism for bidirectional validation between what was specified and what was built.
This paper argues that a formal intermediate layer — a specification language positioned between human intent and autonomous execution — can close this gap. The central claim is narrow and precise: a two-primitive formal specification language can bridge the gap between human intent and autonomous software construction, validated empirically in both forward and reverse directions across four technology stacks.
The claim requires validation in both directions. Forward validation demonstrates that the specification language is sufficient to drive correct autonomous construction: given a well-formed ADL specification, autonomous agents produce working software that passes its own test suite. Reverse validation demonstrates that the specification language captures real domain structure: given an existing production codebase, the ADL pipeline recovers a measurable fraction of the domain model from intent-level documentation alone.
Neither direction alone is sufficient. Forward validation without reverse validation would demonstrate only that a pipeline can generate code from structured input — a capability shared by template engines and code generators. Reverse validation without forward validation would demonstrate only that a language can describe existing systems — a capability shared by UML and database schemas. The combination establishes that ADL is doing something structurally correct: encoding domain knowledge in a form that both drives construction and measures coverage.
The remainder of this paper is organized as follows. Section 2 describes the ADL specification language and its evolution through four versions. Section 3 surveys related work in formal specification, spec-driven generation, and LLM-based code synthesis. Section 4 describes the 6-gate consensus pipeline architecture, including the multi-pass Gate 5 architecture introduced in v0.4 and Gate 5.5 (Scaffold Validator) introduced in v1.0. Section 5 presents the forward direction evaluation, including the Command Center multi-agent build system and results across four enterprise applications. Section 6 presents the bidirectional validation across four technology stacks: the uncontrolled SPENDCITY proof, the Python/FastAPI controlled proof, the Flutter/Dart controlled proof, the LIITOS-AI second uncontrolled validation, and the TOPOS React Native fourth stack validation; §6.13 presents the v1.0 LINETBOOT-USB unseen domain validation with Gate 5.5 results. Section 7 discusses limitations including measurement validity and the Gate 5 domain complexity ceiling, Section 8 outlines future work, and Section 9 concludes.
ADL models a software domain using exactly two constructs: Entity and Transition.
An Entity is a persisted domain object with identity, discrete states, field definitions, ownership relationships, and optionally a lifecycle pattern. Entities are not data models or database tables — they are domain concepts with lifecycle. The distinction is load-bearing: an Entity must exist as a persisted first-class object in the domain, not as a computed view, DTO, or infrastructure concern. This constraint, formalized as the entity grounding rule in v0.4, eliminated a class of false positives from pipeline output.
A Transition is a state change on an Entity with a trigger (the causal event), an actor (who can initiate it), optional guards (preconditions), and optional effects (what else happens when it fires). Transitions encode the behavioral contract of the domain.
entity: Document [
states: draft, pending_review, approved, published, archived
owner: Folder
uses: PublicationLifecycle
evidence: models/document.py:Document
]
transition: Document.draft → Document.pending_review [
trigger: submit_for_review
actor: Author
]
This two-primitive model is intentionally minimal. Ownership hierarchies, state machines, access control actors, and side effects are all expressible through entities and transitions. The minimalism is a design constraint: it forces clarity about what constitutes a domain concept versus an implementation decision.
ADL describes intent, not implementation. The specification defines that a Document can be submitted for review; it does not specify whether that submission is synchronous or asynchronous, whether it fires a webhook or writes to a queue, or how authentication middleware validates the actor. Those decisions belong to the pipeline gates that elaborate the specification into implementation artifacts.
This delegation is what makes ADL portable across technology stacks. The same ADL model can drive a Node.js/Express platform or, with an appropriate adapter, any other stack an execution platform supports. ADL portability has been empirically validated across four distinct technology stacks — Node.js/Express, Python/FastAPI, Flutter/Dart, and React Native/Expo — not just claimed.
v0.3 (initial): Core entity/transition primitives, 6-gate pipeline definition, basic field types.
v0.4 (entity grounding): Added the entity grounding rule — entities must be persisted domain objects. Reverse accuracy against SPENDCITY: ~9% manual.
v0.5 (expressiveness expansion): Six new constructs addressing gaps identified through reverse validation:
extends — 1:1 extension table relationship (FK-as-PK pattern).pattern property on Entity — values: 1to1_extension, soft_delete, audit_log.unique_together — composite unique constraints across multiple fields.async_job extensions — queue, retry_policy for async work items.scheduled_job effect type — schedule.cron or schedule.interval for recurring jobs.system_effects top-level key — for jobs not tied to any specific transition.The scheduled_job construct was added as an effect type within system_effects, not as a new ADL primitive, preserving the two-primitive model while extending the expressiveness of the effects layer.
entity: UserGamification [
extends: User
pattern: 1to1_extension
fields: points:integer, level:integer, streak_days:integer
evidence: backend/models/core.py:UserGamification
]
system_effects:
- effect: scheduled_job
name: currency_sync
schedule:
cron: "0 */6 * * *"
handler: sync_exchange_rates
v0.6 (mobile generalization + measurement validity): Seven new constructs addressing gaps identified through the TOPOS React Native reverse validation (Phase 19):
adl.analytics.yaml — standalone analytics specification artifact, separate from effects; produced by Gate 4 alongside adl.effects.yaml. Enables structured analytics intent without conflating tracking events with domain transitions. (From finding T-012.)cache_ttl split — the ambiguous single cache_ttl field is replaced by two semantically distinct fields: offline_ttl (how long data remains usable offline) and stale_threshold (when a refresh should be triggered). Backward-compatible with a migration rule mapping old cache_ttl to stale_threshold. (From T-013.)stub: true marker — an effect annotated stub: true declares planned-but-not-yet-implemented behavior. Gate 5 emits a warning rather than an error; Gate 6 reports stub effects in a separate deferred-effects section. Enables specifying intent ahead of implementation without blocking the pipeline. (From T-008.)extraction_effect — a documentation construct for auditable multi-tier extraction intent. The source field uses an enum covering the recognized extraction sources: webview | clipboard | api_response | email | document. The api_response value covers backend extraction pipelines (including external API results such as Google Places); share_payload was considered but rejected as too mobile-platform-specific. (From T-006/T-014.)config_sources — an array field in the adl.context.yaml Context Manifest for externalized runtime configuration assets (bundled category lists, environment-specific config files). Captures non-entity configuration that the existing ADL model had no mechanism to express. (From T-009; partially resolves T-001.)derived_entity — a new primitive class for entities that exist in the domain but are not owned or persisted by the application (e.g., results from an external API such as Google Places). derived_entity fields follow the same syntax as entity, but Gate 2 skips schema generation, Gates 3 and 4 may reference them for policy and effects, and Gate 5 emits type stubs only. The name derived_entity was chosen over external_entity because "derived" makes the non-ownership relationship clearer — the application derives data from an external source rather than owning it. (From T-002.)initialization_effect — a system_effects entry for application startup side effects. Includes an idempotent: true/false field that declares re-entrancy intent: database seeding is idempotent: true (safe to re-run on deployment), UUID generation is idempotent: false (must run exactly once). (From T-003.)async: true flag on api_call effect — inline async execution marker for effects targeting async services (e.g., Gemini API) from sync HTTP handlers. Distinct from async_job (background queue). Gate 5 generates stack-appropriate async bridge. (From T-016, confirmed during LIITOS-AI backend build.)The eight v0.6 constructs preserve the two-primitive model while extending the expressiveness of the effects, analytics, and context layers to cover patterns discovered through the TOPOS and LIITOS-AI validations.
UML (Unified Modeling Language) and BPMN (Business Process Model and Notation) represent prior approaches to formal specification of software systems. UML class diagrams can express entity relationships and state machines; BPMN can express business process flows. However, neither was designed for the autonomous construction problem: they specify structure for human interpretation, not for machine-driven code generation with validation. The gap between a UML diagram and working software still requires a human developer to bridge. Domain-Specific Languages (DSLs) narrow this gap for specific problem domains but do not generalize across enterprise application patterns.
ADL differs from these approaches in two respects. First, it is designed as pipeline input — every construct has a defined consumer in the 6-gate architecture, and the output of each gate is a structured YAML artifact, not a diagram for human consumption. Second, the two-primitive constraint forces a level of formalization that UML's rich construct vocabulary does not: every domain concept must be expressed as an entity or a transition, which makes the specification complete enough for autonomous elaboration.
OpenAPI and AsyncAPI represent the spec-driven generation paradigm: a formal specification drives code generation for API clients, server stubs, and documentation. These tools demonstrate that formal specifications can drive construction, but they operate at the interface level — they specify endpoints and message formats, not domain models or behavioral contracts. An OpenAPI specification describes what the API looks like; it does not describe why a Document transitions from draft to pending_review, who is authorized to trigger that transition, or what effects the transition should produce.
ADL operates at the domain model level, upstream of API specification. The ADL pipeline's Gate 2 (Schema) and Gate 4 (Effects) produce artifacts that could generate an OpenAPI specification, but the ADL model itself encodes the domain semantics that OpenAPI cannot express.
GitHub Copilot, Cursor, and Devin represent the current generation of LLM-powered code synthesis tools. These systems accept natural language intent and produce code — in some cases, entire applications. The limitation is not generation capability but validation: there is no formal intermediate representation between intent and output that enables checking whether the generated code matches the domain model the author intended.
This is the gap ADL addresses. A developer using Copilot writes intent in natural language and receives code that may or may not capture the intended domain model correctly. A developer using ADL writes intent, the pipeline formalizes it as entities and transitions, consensus validates each elaboration step, and the resulting code is traceable to the formal specification. The specification serves as a reference point for both forward construction and reverse coverage measurement — a capability no natural-language-to-code pipeline provides.
The ADL pipeline is a 6-gate sequential processing chain. Each gate consumes structured YAML artifacts from the previous gate, elaborates them through multi-model consensus, and produces the next artifact. No natural language passes between gates after Gate 1.
intent.md
│
▼
Tech Profiler (deterministic)
│
▼
adl.context.yaml ──────────────────────────────────────────────────
│
▼
Gate 1 — Designer [consensus: thorough] ──▶ adl.core.yaml
│ 5 models, entity/transition extraction
▼
Gate 2 — Schema [consensus: balanced] ──▶ adl.schema.yaml
│ DB schema, field types, constraints
▼
Gate 3 — Policy [consensus: thorough] ──▶ adl.policy.yaml
│ Access rules, actor permissions
▼
Gate 4 — Effects [consensus: balanced] ──▶ adl.effects.yaml
│ Side effects, async jobs, scheduled work
▼
Gate 5 — Target [consensus: fast] ──▶ scaffolding/
│ File structure, code scaffolding targets
▼
Gate 5.5 — Validator [deterministic] ──▶ validated scaffolding/
│ Syntax check, stub fallback, entity coverage
▼
Gate 6 — Delta [consensus: balanced] ──▶ divergence report
Existing codebase vs. spec model
The pipeline inputs are intent.md (human-authored system description) and adl.context.yaml (technical context extracted by the Tech Profiler: language, framework, dependencies).
Gate 1 — Designer: Core domain modeling. Produces adl.core.yaml — the entity and transition model. This is the most consequential gate: entity identification errors propagate through all subsequent gates. Uses the thorough consensus tier.
Gate 2 — Schema: Database schema elaboration. Maps entities to schema definitions, resolves field types, applies unique_together constraints, handles extends relationships.
Gate 3 — Policy: Access control policy. Maps transition actors to permission rules. Uses the thorough consensus tier because policy errors are silent and dangerous.
Gate 4 — Effects: Side effects elaboration. Resolves transition effects and system_effects into implementation targets.
Gate 5 — Target: Scaffolding specification. Produces file structure and code targets for agent dispatch.
Gate 5.5 — Scaffold Validator: Post-generation validation introduced in v1.0. Runs python3 -m py_compile against all generated files; replaces syntax-error files with valid import stubs; verifies entity coverage against adl.core.yaml; applies per-transition-group sub-passing for entities with 7+ transitions. Deterministic — no LLM calls. 45 tests.
Gate 6 — Delta: Divergence analysis. Compares an existing codebase against the current spec model for reverse validation.
Each gate's output is validated by a 5-model consensus engine. The five models are queried independently; their outputs are compared using a Jaccard similarity metric with a 3-of-5 threshold for acceptance. Below-threshold disagreement surfaces as an open question rather than proceeding with a potentially incorrect answer.
The architectural insight is that LLM disagreement is information. When four of five models agree on a classification, the majority is likely correct. When three of five disagree, the question is genuinely ambiguous — which means the intent document is ambiguous, requiring human clarification before the pipeline proceeds.
Three consensus quality tiers are supported:
Gate 5 generates scaffolding code for the target technology stack. In single-pass mode, the gate issues a single LLM request to generate all target files. For full-stack targets such as fastapi_generic (13 files: models, routes, auth, alembic, conftest, test files, requirements), this creates a structural problem: LLM output length limits cause generation to truncate, producing only 5 of 13 required files (38% file ratio).
The multi-pass architecture solves this by decomposing Gate 5 into 4 sequential LLM calls, each scoped to a logical file group:
| Pass | Group | Files Generated |
|---|---|---|
| 1 | Core | models.py, schemas.py, database.py |
| 2 | Routes | main.py, entity route files |
| 3 | Infrastructure | auth.py, effects handlers, alembic migration, requirements.txt |
| 4 | Tests | conftest.py, per-entity test files |
Each pass receives the output from all prior passes as context, formatted as annotated code blocks. This context chaining enables later passes to reference exact class names, import paths, and schema definitions from earlier passes — preventing the inconsistencies that arise from independent generation.
Result: 100% file coverage (13/13 files) across all four passes. The multi-pass architecture is target-agnostic: the same pass-group pattern generalizes to any sufficiently large scaffolding target.
Fallback: The multi-pass loop is wrapped in a try/catch that falls back to single-pass generation on any error, preserving backward compatibility with existing behavior.
To evaluate whether ADL specifications can drive correct autonomous software construction, four enterprise applications were built entirely from ADL models by autonomous agents. Four applications rather than one were chosen to demonstrate that the specification language and pipeline are not coupled to any particular domain shape.
| Application | Entities | Domain | Characteristic Complexity |
|---|---|---|---|
| DocFlow | 9 | Document lifecycle management | Multi-state approval workflows |
| VisitPass | 8 | Visitor and access management | Time-bounded permissions, audit trails |
| AssetVault | 8 | Physical/digital asset tracking | Ownership transfer, depreciation states |
| OnboardKit | 15 | Employee onboarding orchestration | Multi-actor coordination, task dependencies |
Each application produces identical artifact categories from its ADL model: Prisma schema, TypeScript route handlers, JWT authentication middleware, Swagger API documentation, seed data, and three UI variants (HTML, Tailwind CSS, React CDN). The target platform is Express + Prisma + TypeScript + SQLite. All four applications run on a shared platform with namespaced routes and unified authentication.
The Command Center is the execution layer bridging ADL pipeline output to running software. It decomposes pipeline output into typed task packets and dispatches autonomous agents in dependency order.
adl build <run-id-1> <run-id-2> ...
│
▼
Work Decomposer ─────────────────────────────────────────────────
│ reads adl.core.yaml per app
│ extracts entities, applies namespace prefixes
│ generates TaskPackets: app × taskType, wave-assigned
▼
TaskPacket queue (per-app × 8 task types = N packets)
│
▼
Wave Executor ──────────────────────────────────────────────────
│
├─ Wave 0: schema, types [parallel, p-limit concurrency]
├─ Wave 1: auth, routes [parallel]
├─ Wave 2: ui-html, ui-tailwind, [parallel]
│ ui-react
├─ Wave 3: docs, seed [parallel]
└─ Wave 4: merge [sequential]
│
▼
Agent Dispatcher ────────────────────────────────────────────────
│ LocalAdapter wraps Claude Agent SDK query()
│ per-task agent directory with populated prompt
│ drains AsyncGenerator, returns AgentResult
▼
Consensus Validator ─────────────────────────────────────────────
│ 5-model Jaccard similarity for schema + routes only
│ 3/5 threshold; structural tasks skip validation
▼
Steering Engine ─────────────────────────────────────────────────
│ re-dispatches on validation failure
▼
Output Collector + Build Report
Work Decomposer. Reads adl.core.yaml from each run directory, extracts entities, and generates TaskPackets — one per app-taskType combination. For four applications with eight task types, this produces 32 packets. Namespace prefixing prevents Prisma model name collisions across applications sharing a single schema folder.
Agent Dispatcher. Uses the LocalAdapter, which wraps the Claude Agent SDK's query() function. For each TaskPacket, the adapter creates an isolated agent directory, writes a populated prompt from the task's template, and invokes the SDK. Constructor injection of queryFn enables test isolation.
Consensus Validator. Schema and route generation are non-deterministic across LLM runs. For these task types, the validator queries the consensus engine with both outputs and applies Jaccard similarity. Auth, UI, and documentation tasks are structurally deterministic from their inputs and skip consensus validation.
const CONSENSUS_TASK_TYPES = new Set(['schema', 'routes']);
Steering Engine. On validation failure, the steering engine re-dispatches with the original prompt plus the validation failure reason. The retry cycle is bounded; persistent failures are recorded and the build continues.
Wave Isolation. Promise.all with per-agent try/catch at the wave level ensures a failing agent for one application does not abort another's build.
Adapter Pattern. The LocalAdapter is one implementation of the dispatch interface. A stub adapter demonstrates that the interface was designed to be swappable — any execution platform exposing an agent dispatch API can integrate by implementing the adapter contract.
All results are from completed autonomous builds. No test code was written by humans.
| Metric | Value |
|---|---|
| Tests passing | 193 (zero failures) |
| Lines of code generated | ~7,000 |
| UI variant files | 12 (4 apps × 3 variants) |
| Swagger API docs | 4 (dynamically loaded per-app) |
| Seed data scripts | 4 (demo-ready state per app) |
| Build report | Per-task status, cost, duration, attempts |
| Gate 5 file coverage | 38% (single-pass) → 100% (multi-pass) |
| Application | Entities | Task Types | UI Variants | Status |
|---|---|---|---|---|
| DocFlow | 9 | 8 | 3 | Complete |
| VisitPass | 8 | 8 | 3 | Complete |
| AssetVault | 8 | 8 | 3 | Complete |
| OnboardKit | 15 | 8 | 3 | Complete |
| Total | 40 | 32 packets | 12 files | All passing |
The 193 tests include unit tests for all Command Center components (decomposer, dispatcher, executor, validator, steering, collector) plus integration tests running against real ADL run data. Integration tests verify that the full pipeline produces structurally valid Prisma schemas and correctly typed route handlers.
Starting from four intent documents (1-3 pages each of plain English), the pipeline produced: Prisma schemas with correct entity relationships, Express route handlers with JWT-validated endpoints for all CRUD operations and state transitions, authentication middleware implementing Policy gate access rules, three UI variants per application, OpenAPI 3.0 Swagger specifications, seed scripts, and TypeScript type definitions.
Agent-written test validity. When tests are written by the same agents that generate code, there is a risk of tests reflecting implementation behavior rather than specification intent. The ADL pipeline mitigates this through separation: schema and route agents produce output in separate waves with separate prompts, and the test agent receives the ADL spec model directly as context, not the implementation artifacts. Test correctness is validated by the consensus engine before acceptance.
Multi-pass Gate 5 as an architectural contribution. The multi-pass architecture resolves a fundamental constraint in LLM-driven code generation: output length limits prevent complete file-set generation in a single call. By decomposing generation into logical passes with context chaining, the pipeline achieves 100% file coverage for full-stack targets. This pattern generalizes: any sufficiently complex scaffolding target can be decomposed into sequential passes without changing gate interface contracts.
All output — 40 domain entities, 32 task-type implementations, 193 tests, 12 UI files — derives from a specification language with exactly two constructs.
SPENDCITY is a production Python backend: FastAPI + PostgreSQL + SQLAlchemy + Celery + Gemini OCR. It is a personal finance platform with OCR receipt processing, multi-currency tracking, AI categorization, gamification, team budgets, what-if analysis, and subscription detection. It was not designed with ADL in mind.
The codebase contains 118 SQLAlchemy model classes: 77 domain entities, 33 infrastructure/operational models, and 8 junction tables. The 77 domain entities constitute the ground truth for reverse validation.
The ADL pipeline operates read-only against SPENDCITY. Gate 1 reads intent.md and adl.context.yaml only — no source code, migration history, or ORM class definitions. This constraint is the experimental condition being validated: what can an intent-only pipeline recover?
| Metric | Value |
|---|---|
| Ground truth domain entities | 77 |
| Entities found by Gate 1 | 14 |
| Automated accuracy (exact name match) | 6.5% |
| Estimated manual accuracy | ~9% |
| False positives | 9 |
The 6.5% automated accuracy reflects exact lowercase name matching between Gate 1 output entity names and SQLAlchemy class names. As discussed in §6.4, this metric systematically understates accuracy.
After incorporating 6 new spec constructs (§2.3):
| Metric | Value |
|---|---|
| Ground truth domain entities | 77 |
| Entities found by Gate 1 | 21 |
| Exact name matches | 8 |
| Semantic domain matches | 7 |
| Infrastructure-classified correct identifications | 3 |
| True false positives | 3 |
| Automated accuracy (exact name match) | 5.2%* |
| Manual accuracy (domain entities) | 19.5% (15/77) |
*The automated metric decreased despite more correct identifications because v0.5 found more entities with non-matching names.
v0.4 → v0.5 delta:
| Metric | v0.4 | v0.5 | Delta |
|---|---|---|---|
| Entities found | 14 | 21 | +50% |
| Manual accuracy | ~9% | 19.5% | +10.5pp |
| True false positives | ~5 | 3 | ~40% reduction |
The 21 v0.5 entities include 8 exact matches (User, Receipt, Category, SupportTicket, AdminSetting, BackgroundJob, FinancialHealthScore, QueryLog), 7 semantic matches (e.g., UserPreference→UserSetting, SystemConfig→SystemSetting, TransactionLineItem→Item), 3 infrastructure-classified correct identifications, and 3 true false positives.
The 19.5% figure requires interpretation. Gate 1 reads intent.md and adl.context.yaml — nothing else. It has no access to source code, migration history, or ORM definitions.
The 62 missed entities include the entire gamification subsystem (Achievement, UserAchievement, LeaderboardEntry), multi-currency infrastructure (Currency, ExchangeRate, SubscriptionTier, TierPricing), what-if analysis (WhatIfCO2Factor, WhatIfSwapSuggestion, WhatIfInflationRate, WhatIfThreshold), team budgeting (Team, TeamMember, TeamBudget, TeamInvitation, TeamRoleDefinition), and approximately 30 more entities across email, content, search, and OCR subsystems. None of these appear in the SPENDCITY intent document. They were built as implementation decisions during development.
This is the correct behavior of an intent-driven pipeline. Gate 1 identifies the entities a developer would specify upfront. The 62 missed entities are invisible to any intent-only reader — not because the specification is incomplete, but because the specification was never written.
The forward/reverse asymmetry is the central architectural insight:
Forward direction: Intent drives everything. If intent is complete, the pipeline builds the system correctly. The four demo applications demonstrate this — every entity in the ADL model was specified, so every entity was correctly built.
Reverse direction: The pipeline can only recover what the intent document contains. The ceiling is bounded by intent coverage, not spec expressiveness.
The 19.5% figure says something precise about SPENDCITY: approximately one-fifth of its domain entities were developed with sufficient intent documentation for an intent-only reader to recover them. The rest accumulated as unspecified implementation decisions. This is a diagnostic about the development process, not a limitation of ADL.
Practical implication: The reverse pipeline is not a specification recovery tool. It is a coverage measurement: the accuracy ceiling becomes a technical debt metric measuring the fraction of a codebase that was formally specified versus accumulated without specification.
Reverse validation identified 7 gaps (GAP-001 through GAP-007); 6 were addressed in v0.5:
| Gap ID | Gate | Severity | Construct Added |
|---|---|---|---|
| GAP-001 | 1 | blocks-accuracy | extends + pattern: 1to1_extension |
| GAP-002 | 4 | degrades-quality | queue, retry_policy on async_job |
| GAP-003 | 4 | degrades-quality | scheduled_job + system_effects |
| GAP-004 | 1 | cosmetic | pattern: audit_log |
| GAP-005 | 2 | degrades-quality | unique_together |
| GAP-006 | 1 | degrades-quality | pattern: soft_delete |
| GAP-007 | 3 | degrades-quality | Deferred (multi-tenant scope_by) |
GAP-001 had the highest impact: SPENDCITY uses 1:1 extension tables extensively (UserGamification, UserAlertPreferences, UserEmailPreferences, UserPushPreferences). Without the extends construct, Gate 1 missed these structural relationships entirely.
The gap catalog process is self-reinforcing: each reverse validation run produces a catalog, those gaps become spec additions, and the improved spec improves detection in the next run. The v0.4 → v0.5 improvement (+10.5 percentage points) demonstrates this feedback loop operating in practice.
The second bidirectional proof applied ADL to the TaskFlow application — a 4-entity task management system (User, Project, Task, Comment) generated by ADL itself in Phase 10. This is a controlled proof: ADL reads back the code it generated.
Setup: The TaskFlow Python/FastAPI app was generated from apps/taskflow/intent.md using Gate 5 target fastapi_generic. The reverse run (adl extract --source) applied the SQLAlchemy source parser to apps/taskflow/generated/app/, with all 5 LLMAPI models responding.
Entity recovery: 4/4 = 100%. All five models independently recovered the exact same four entities: Project, Task, Comment, and User. No false positives were produced. The source parser correctly identified the Base(DeclarativeBase) class and excluded it.
Spec fidelity: 97%. The core entity shapes, states, and transition set were recovered exactly. The two fidelity points lost reflect Gate 2 elaboration conventions rather than source parser failures:
name (SQLAlchemy column) → title (reverse Gate 2 convention)text (SQLAlchemy column) → content (reverse Gate 2 domain convention)Transition recovery: 10/10 core transitions recovered exactly. The reverse run additionally inferred 2 valid transitions not in the forward spec: a creation lifecycle entry transition and a Task.InReview → InProgress reject/rework path — both present in the generated FastAPI route semantics.
Key finding: Gate 2 naming convention divergences (name→title, text→content) are systematic and stack-independent. The same pattern appears in both the Python/FastAPI and Flutter/Dart proofs, indicating this is a Gate 2 elaboration behavior rather than a source parser limitation.
The third bidirectional proof applied ADL to the same TaskFlow application, this time targeting the Flutter/Dart mobile stack generated in Phase 13 using Gate 5 target flutter_dart. A Dart source parser was added (Phase 14) to read .dart model files, extract class fields and enums, and skip generated .g.dart files.
Setup: The Flutter/Dart app was generated from the same apps/taskflow/intent.md. The reverse run (adl extract --source) applied the Dart parser to apps/taskflow/flutter/lib/models/, detecting 11 Dart models (8 classes + 3 enums) across 4 model files. All 5 LLMAPI models responded.
Entity recovery: 4/4 = 100%. All five models recovered all four domain entities (User, Project, Task, Comment). The Dart DTO suffix naming pattern (TaskResponse, CommentResponse, TaskCreate, TaskUpdate, CommentCreate, CommentUpdate) was correctly handled: all five models mapped the DTO names to their underlying domain entities without producing false positives for the Create/Update DTOs.
Spec fidelity: 93%. The 7-point gap below the Python/FastAPI proof reflects two Dart-specific structural challenges:
DTO suffix pattern (TaskResponse naming): Dart conventions use Response/Create/Update suffixes on API response classes. The source parser correctly surfaced these as separate model evidence, and Gate 1 correctly discriminated domain entities from DTOs. However, entity name matching involves one additional normalization step.
Type-erased status field: The TaskResponse.status field is typed as String in Dart (not typed to the TaskStatus enum). Gate 1 had to infer state values from the separately surfaced TaskStatus enum evidence rather than from the entity class field directly, introducing one additional inference step that produced partial state recovery in some models.
Cross-stack type finding: Flutter models use String IDs (UUID strings) while the forward ADL schema uses int IDs — a client-server contract mismatch introduced by the Flutter template's assumption of string UUIDs. This is documented as GAP-009 (§7).
Source parser generalization: The Dart parser (Phase 14) follows the same extension pattern as the Python parser (Phase 9): regex-based class/field/enum extraction, generated-file exclusion (.g.dart), and integration into parseSourceModels() auto-detection. No gate logic changes were required. The pattern generalizes cleanly across languages.
At the v0.5 milestone, ADL bidirectionality had been proven across three distinct technology stacks:
| Proof | Stack | App | Type | Entities | Fidelity | Key finding |
|---|---|---|---|---|---|---|
| v0.1 | Node.js/Express | SPENDCITY | Uncontrolled | 77/77 | ~100% entity capture | 41 infrastructure false positives (noise tolerance proven) |
| v0.3 | Python/FastAPI | TaskFlow | Controlled | 4/4 | 97% | Gate 2 naming conventions (name→title, text→content) |
| v0.4 | Flutter/Dart | TaskFlow | Controlled | 4/4 | 93% | DTO suffix pattern + type-erased status String field |
The three proofs established that ADL's two-primitive model captures domain structure in a form that is simultaneously sufficient for autonomous construction and recoverable from generated artifacts across paradigmatically distinct stacks. The v0.6 validation added two further uncontrolled proofs — LIITOS-AI (§6.9) and TOPOS (§6.10) — and a practitioner corroboration (§6.11); the updated four-stack summary is in §6.12.
A second uncontrolled reverse validation was conducted against LIITOS-AI — a production personal AI/ML frontier tracker, distinct from SPENDCITY in domain, stack details, and codebase origin.
Codebase summary:
| Attribute | Value |
|---|---|
| Application | AI/ML frontier tracker with morning briefing synthesis |
| Backend stack | FastAPI + PostgreSQL + pgvector + SQLAlchemy + Gemini API |
| Total backend LOC | ~7,970 Python |
| SQLAlchemy pattern | class Foo(Base): (Declarative Base — NOT db.Model or BaseModel) |
| Domain entities (ground truth) | 10 (User, FeedSource, ContentItem, FetchResult, Briefing, BriefingCard, UserBookmark, MediaLink, UserNotification, WeeklyPodcast) |
Pre-condition fix: Before running the pipeline, analyze.js CLASS_RE was extended (Phase 18, Plan 01) to detect Base and DeclarativeBase SQLAlchemy patterns alongside the existing db.Model and BaseModel* patterns. Without this fix, adl analyze would have returned 0 ground-truth entities for LIITOS-AI even in with-source mode.
Run 1 — Intent-only (no --source flag):
The ADL profiler scanned 8 source files (134k chars). Gate 1 extracted 16 entities — all from the ADL/GSD planning documentation co-located in the backend directory. None of the 16 extracted entities matched LIITOS-AI's 10 real domain entities.
| Metric | Value |
|---|---|
| Ground truth domain entities | 10 |
| Captured by ADL | 0 |
| False positives | 16 |
| Intent-only accuracy | 0% |
Root cause: The .planning/ directory inside the LIITOS-AI backend contains ADL/GSD workflow documentation (Phase, Task, Plan, Decision entities). This planning metadata dominated Gate 1 intent generation, producing project-management entities instead of the application's actual domain. This is not a regression — it demonstrates the known limitation of intent-only mode when meta-documentation is co-located with source code.
Run 2 — With-source (--source flag):
With source models provided, Gate 1 received structured class evidence from the SQLAlchemy model files and extracted exactly the 10 real domain entities.
| Metric | Value |
|---|---|
| Ground truth domain entities | 10 |
| Captured by ADL | 10 |
| Missed by ADL | 0 |
| False positives | 0 |
| With-source accuracy | 100% |
All 10 entities matched: User, FeedSource, ContentItem, FetchResult, Briefing, BriefingCard, UserBookmark, MediaLink, UserNotification, WeeklyPodcast.
Comparison with SPENDCITY:
| Metric | SPENDCITY | LIITOS-AI |
|---|---|---|
| Domain entities (ground truth) | 77 | 10 |
| SQLAlchemy pattern | db.Model / BaseModel* |
Base (DeclarativeBase) |
| Intent-only accuracy | 19.5% | 0% |
| With-source accuracy | 100% | 100% |
The intent-only difference is explained by the quality of intent documentation, not pipeline capability. SPENDCITY's intent document describes the actual application domain; LIITOS-AI's intent scan was dominated by ADL/GSD workflow docs in the same directory tree.
What LIITOS-AI validates:
class Foo(Base): patterns. The fix was necessary and sufficient — without it, with-source accuracy would also have been 0%.fastapi_generic target ran without errors against a 10-entity backend, generating 32 files across 4 passes. Gate 5 Pass 4 (Tests) generated 16 test files, consistent with the GAP-010 fix from Phase 17.Multi-codebase validation against a second uncontrolled production codebase is now complete. The with-source accuracy of 100% on both SPENDCITY and LIITOS-AI — across two different SQLAlchemy ORM patterns — strongly supports the generalization claim for the --source pipeline path.
A third uncontrolled reverse validation was conducted against TOPOS — a React Native + Expo mobile POI finder with a Flask stateless proxy backend — as the first mobile-first validation and the fourth distinct technology stack overall.
Codebase summary:
| Attribute | Value |
|---|---|
| Application | POI finder with Google Places integration |
| Stack | React Native (Expo Router) + TypeScript + Flask stateless proxy (Python) |
| Entity type | TypeScript interfaces and type aliases (no ORM) |
| Domain entities (ground truth) | 8 (Category, Place, ETAElement, SearchParams, Coords, ExtractionResult, LocationResult, SortOrder) |
| Notable characteristic | All domain entities are frontend TypeScript types; Flask backend is a stateless proxy with no database layer |
Pre-condition: TypeScript parser added (Phase 20, Plan 01). Before running the pipeline, a TypeScript source parser was integrated into source-parser.js (collectTsFiles + parseTsFile) and a TypeScript ground truth scanner was integrated into analyze.js (TS_INTERFACE_RE, TS_TYPE_RE, detectTsDir). These additions are additive — all 322 existing tests continue to pass.
Run 1 — Intent-only (no --source flag):
The ADL profiler scanned 7 source files (138k chars). The intent generation was dominated by co-located planning documentation (.claude/, CLAUDE.md, SPEC.md, topos.intent.yaml) in the TOPOS project root. Gate 1 extracted 11 entities — all planning-system entities, none matching TOPOS application domain.
| Metric | Value |
|---|---|
| Ground truth domain entities | 8 |
| Captured by ADL | 0 |
| False positives | 11 |
| Intent-only accuracy | 0% |
Root cause: Same pattern as LIITOS-AI. Co-located project meta-documentation dominated Gate 1 intent generation. This is not a regression — it confirms the documented limitation of intent-only mode in codebases with significant co-located planning artifacts.
Run 2 — With-source (--source flag):
With source models provided (15 TypeScript types extracted from src/types/), Gate 1 extracted 6 entities from the structured type evidence.
| Metric | Value |
|---|---|
| Ground truth domain entities | 8 |
| Captured by ADL | 4 |
| Missed by ADL | 4 |
| False positives | 2 |
| With-source accuracy | 50% |
Captured entities (4/8): Category, Place, ETAElement, SearchParams. Missed (4/8): Coords, ExtractionResult, LocationResult, SortOrder.
The 50% result is not a pipeline failure. Gate 1 correctly selected the four richest domain-model-shaped entities and excluded the four utility types. The measurement artifact arises from the ground truth scanner: it treats all exported TypeScript interfaces and type aliases as domain entities, but Coords (2-field lat/lng interface), SortOrder ('time' | 'rating' | 'distance' union type alias), LocationResult (discriminated union), and ExtractionResult (utility interface) are not ADL-style domain entities. Gate 1 made the correct classification decision; the accuracy metric penalized it because the ground truth denominator was inflated by utility types.
This is distinct from the Python/ORM case, where every class inheriting from Base or db.Model is definitionally a domain entity. TypeScript's type system spans utility types, value types, discriminated unions, and rich domain models with no structural boundary between them.
Note on target selection: No react_native_generic Gate 5 target exists. The node_express target was used as a fallback. This affects Gate 5 scaffolding only — entity extraction and accuracy analysis (Gates 1 and 6) are target-independent.
What TOPOS validates:
src/types/*.ts. All 8 were present in the ground truth dataset.An independent practitioner building at enterprise scale reached the same architectural conclusion — that a formal intermediate specification layer between intent and autonomous construction is necessary for reliable agent-driven development. This convergence is anecdotal corroboration, not a controlled study, and is reported as such. It suggests the structural insight behind ADL is not artifact of a single development context but is discoverable independently by practitioners operating under different constraints and on different problem domains.
ADL bidirectionality has been proven across four distinct technology stacks:
| Proof | Stack | App | Type | Entities | With-source | Key finding |
|---|---|---|---|---|---|---|
| v0.1 | Node.js/Express | SPENDCITY | Uncontrolled | 77 | 19.5%* | Intent coverage ceiling; 41 infra false positives (noise tolerance proven) |
| v0.3 | Python/FastAPI | TaskFlow | Controlled | 4/4 | 97% | Gate 2 naming conventions (name→title, text→content) |
| v0.4 | Flutter/Dart | TaskFlow | Controlled | 4/4 | 93% | DTO suffix pattern + type-erased status String field |
| v0.5 | Python/FastAPI | LIITOS-AI | Uncontrolled | 10/10 | 100% | CLASS_RE generalization; intent-only 0% from meta-doc co-location |
| v0.6 | React Native/Expo | TOPOS | Uncontrolled | 8 | 50%** | TypeScript utility type classification inflates ground truth denominator |
*Intent-only accuracy. SPENDCITY with-source was not separately measured. **50% is a measurement validity finding, not a pipeline limitation — see §7.
Each proof validates a different dimension of ADL's bidirectional claim:
The four proofs together establish that ADL's two-primitive model captures domain structure in a form that is simultaneously sufficient for autonomous construction and recoverable from generated artifacts — across paradigmatically distinct technology stacks including mobile-first React Native applications.
A forward validation was conducted against an entirely unseen domain: LINETBOOT-USB, an enterprise USB-based OS deployment management system. ADL had no prior exposure to this domain. All domain knowledge was elicited through 12 adl init questions answered by a human in a single session.
Domain description:
| Attribute | Value |
|---|---|
| Application | Enterprise USB-based OS deployment management |
| Entities | 6 (Device, DeploymentJob, OSCatalog, BootConfig, OSImage, DeploymentRollback) |
| Transitions | 29 |
| Target | FastAPI + SQLAlchemy (Python) — fastapi_generic |
| Prior ADL exposure | None — fully unseen domain |
All domain knowledge came solely from the 12-question adl init elicitation. No source code, schema files, or existing documentation was provided to the pipeline.
The 8-attempt journey:
The v1.0 validation was not a single-run success. Eight pipeline runs were executed against the LINETBOOT-USB domain, producing a detailed failure catalog that ultimately drove two architectural contributions to ADL: Gate 5.5 and a revised contract definition.
Attempts 1-5 were evaluated against the original ADL_CONTRACT.md, which required a zero-intervention bootable app: a running FastAPI application reachable via HTTP with all database migrations applied. All five attempts scored 0/7 on this contract. The failure pattern was systematic rather than random:
| Failure Category | Occurrence |
|---|---|
device.py route file truncation (token limit) |
All 5 attempts |
| Auth login router absent | All 5 attempts |
Error response shape mismatch (detail vs error) |
All 5 attempts |
| Model attribute errors (varied per attempt) | Attempts 2, 4, 5 |
18 hallucinations were documented across 5 attempts — different files broke in each run, demonstrating that stateless LLM code generation cannot guarantee cross-file semantic consistency for moderately complex domains (6+ entities, 29 transitions). Each generation run was independent: prior run context did not carry forward, so inconsistencies accumulated rather than converged.
Contract redefinition:
After 5 attempts, the experimental evidence was clear: the original contract (zero-intervention bootable app) was not achievable with the current Gate 5 scaffold generator on domains of this complexity. Rather than continuing with a contract the tool could not meet, ADL_CONTRACT.md was rewritten to reflect what ADL v1.0 actually delivers: a validated formal spec plus a consistent scaffold that a developer completes in hours, not days.
The revised 7 criteria are:
.py files pass python3 -m py_compileget_current_user + JWT in app/auth.pypip install -r requirements.txt succeedsThis is an honest account of boundaries, not a retreat from the original goal. The original contract required correctness at the application level; stateless LLM generation cannot guarantee this at the current capability level. The revised contract certifies what the pipeline reliably does: produce a structurally correct, syntactically valid, developer-completable starting point — which is still enormously more valuable than starting from scratch.
Gate 5.5 — Scaffold Validator (architectural contribution):
The LINETBOOT-USB journey motivated a new pipeline gate: Gate 5.5, the Scaffold Validator. Introduced between Gate 5 (scaffold generation) and Gate 6 (delta analysis), Gate 5.5 performs post-generation validation before the scaffold is presented to the developer.
Gate 5.5 capabilities:
.py files are run through python3 -m py_compile. Syntax errors are caught at generation time, not at runtime.adl.core.yaml (not adl.schema.yaml), ensuring route file generation and test generation reference the same entity list consistently.Gate 5.5 is validated by 45 deterministic tests covering syntax checking, stub insertion, entity extraction, and sub-pass splitting.
Attempt 8 result: 7/7 on revised contract:
| Criterion | Result |
|---|---|
| Spec artifacts valid (4 YAML files parse) | PASS |
All .py files pass py_compile |
PASS |
| All entities have SQLAlchemy models + Pydantic schemas | PASS |
Auth pattern present (get_current_user + JWT) |
PASS |
| Test scaffolding covers all entities | PASS |
pip install -r requirements.txt succeeds |
PASS |
| Gate 6 verdict: ACCURATE | PASS |
7/7 criteria pass. v1.0 tag achieved.
What this means:
ADL v1.0 reliably produces a validated formal spec plus a consistent scaffold from 12 questions in under 30 minutes. The spec artifacts (Gates 1-4) formally encode the domain model: entities, transitions, access policies, and side effects. The scaffold (Gate 5, validated by Gate 5.5) provides a project structure with models, routes, auth, and tests that would take days to write from scratch.
The developer completes bootstrap in hours: reviewing and extending generated route implementations, running database migrations, and extending test scaffolding. The generated spec is the durable artifact — the scaffold is the accelerator.
The 8-attempt LINETBOOT-USB journey is the most honest validation in this paper: a completely unseen domain, a contract that was revised when the original proved unreachable, an architectural contribution (Gate 5.5) motivated by the failures, and a final 7/7 pass that is both reproducible and verifiable. This is the validation standard for v1.0.
Several limitations bound the claims in this paper.
Intent-coverage ceiling. Reverse accuracy is fundamentally bounded by intent document completeness. The 19.5% result reflects SPENDCITY's intent coverage, not ADL's maximum capability. Codebases with comprehensive specification documents would yield higher accuracy; codebases with no specification documents would yield near-zero accuracy. The pipeline cannot recover what was never specified.
Controlled vs. uncontrolled validation. The three-stack proof includes one real-world uncontrolled test (SPENDCITY, Node.js/Express) and two controlled tests (TaskFlow, Python/FastAPI and Flutter/Dart). Controlled proofs test round-trip precision — ADL reading its own output — which eliminates the intent-coverage ceiling. The 97% and 93% fidelity numbers are not comparable to the 19.5% SPENDCITY result; they measure different things. A second uncontrolled validation against LIITOS-AI (§6.9) was completed in Phase 18, achieving 100% entity recovery in with-source mode. Intent-only accuracy (0%) is not comparable to the SPENDCITY figure (19.5%) because LIITOS-AI's intent scan was dominated by co-located planning meta-documentation. Both results reflect intent coverage, not pipeline ceiling. Additional uncontrolled tests against codebases with clean, domain-focused intent documentation would better quantify the intent-only ceiling across different application domains.
Exact name matching metric. The automated computeAccuracy function uses exact lowercase string matching between Gate 1 entity names and Python class names. Gate 1 reasons from intent and names entities semantically (e.g., "UserPreference"), while the ground truth uses implementation naming conventions (e.g., "UserSetting"). This metric systematically understates accuracy when the two naming conventions diverge. The manual accuracy figure (19.5%) corrects for this but introduces subjectivity.
Forward validation stack. Forward validation uses Express + Prisma + TypeScript + SQLite exclusively for the four enterprise demo applications. While Gate 5 multi-pass now generates for multiple target stacks (fastapi_generic, flutter_dart), these targets have been validated in reverse mode only — the full forward-to-deployment cycle with autonomous agents has not been completed for Python/FastAPI or Flutter/Dart targets.
Agent-written test circularity. Tests were written by the same class of agents that generated the code. While the pipeline mitigates circularity through wave separation and spec-model test context, the tests have not been independently validated by human developers.
GAP-009: Flutter template int/UUID type mismatch. Flutter-generated models use String IDs (UUID convention) while the forward ADL schema uses int IDs. Fixed in generated model classes for the TaskFlow proof, but the flutter_dart.md template has not been updated to emit consistent ID types. This is a cross-stack type contract issue affecting any Flutter app generated from an ADL spec with integer IDs.
GAP-010: Gate 5 Pass 4 test generation granularity (resolved). The multi-pass architecture initially generated a single test file in Pass 4. Phase 17 resolved this by increasing the test-pass token budget to 8,000 and adding per-entity test instructions, achieving 8 test files per run (up from 1). The fix is validated but LLM output remains non-deterministic — test file count varies between runs.
GAP-011: --openapi flag on adl extract (resolved). The --openapi flag for injecting OpenAPI spec context was initially only on adl run. Phase 17 wired it onto adl extract as well, enabling OpenAPI spec injection in both forward and reverse modes.
Gate 5 domain complexity ceiling. Gate 5 generates route files via LLM calls bounded by a token budget. For entities with 7+ state transitions, a single LLM call may produce a truncated route file. This was the primary failure mode in LINETBOOT-USB Attempts 1-5, where device.py (the entity with the most transitions) was truncated in every single attempt. Gate 5.5 addresses this through per-transition-group sub-pass splitting and stub fallback — but the underlying LLM output token limit is a hard constraint that sub-passing manages rather than eliminates. Domains with 6+ entities and 20+ total transitions should be considered at the upper boundary of Gate 5's current generation envelope. Simpler domains (4 entities, 17 transitions) generate reliably without issue.
Forward validation limited to scaffolding contract. The v1.0 forward contract certifies spec artifacts, scaffold syntax, entity coverage, auth pattern, test scaffolding, and pip installability. It does not certify a working, bootable application. The 8-attempt LINETBOOT-USB journey demonstrated that zero-intervention bootable apps are not achievable with stateless LLM generation at this domain complexity (6 entities, 29 transitions). The contract reflects what ADL v1.0 reliably delivers — not a weaker claim than initially intended, but an honest account of the capability boundary that serial empirical testing revealed.
Test generation coverage. Gate 5 generates minimum viable tests — happy path per entity (create + read). Branch coverage, edge cases, and error path tests are not generated. Test count meets the contract minimum but coverage depth is the developer's responsibility.
Measurement validity and ground truth assumptions. The accuracy metric assumes unambiguous ground truth entities — a requirement that ORM-backed stacks satisfy but TypeScript stacks do not. In Python/SQLAlchemy codebases, a class inheriting from Base or db.Model is definitionally a domain entity: the ORM inheritance is a structural boundary that cleanly separates domain models from utility classes. In Dart codebases, class naming conventions (model files, Response/Create/Update suffixes) provide a similar, if weaker, structural signal.
TypeScript codebases do not satisfy this requirement. export interface and export type declarations span rich domain models (Place with 10+ fields, ETAElement, SearchParams) and utility types (Coords as a 2-field lat/lng tuple, SortOrder as a string literal union, LocationResult as a discriminated union, ExtractionResult as a utility interface). The current extractGroundTruth scanner treats all exported TypeScript interfaces as domain entities, inflating the ground truth denominator.
The TOPOS 50% with-source accuracy result demonstrates this artifact directly: Gate 1 correctly classified the four richest domain models as entities and excluded the four utility types, but the accuracy metric scored this as 50% because the utility types were included in the ground truth count. This is a measurement artifact, not a pipeline limitation.
Implication for cross-stack comparison. Accuracy figures from ORM-backed stacks (Python/SQLAlchemy: 100%, 19.5%) and TypeScript stacks (TOPOS: 50%) are not directly comparable because the ground truth denominator is constructed differently. Future work on entity classification heuristics for TypeScript ground truth — distinguishing domain entities from utility types by structural analysis (field count, field type richness, presence of identifier fields) — would produce a more comparable metric and likely restore with-source accuracy to parity with ORM-backed stacks.
Gate 1 v2: source model reading. Implemented and validated across three stacks. Gate 1 --source mode reads source model files (SQLAlchemy classes, Dart class definitions) as supplementary input alongside intent documents. This raises the accuracy ceiling from ~20% (intent-only) toward ground truth. The source parser extension pattern generalizes across languages with no gate logic changes.
Semantic accuracy metric. Replacing exact name matching with embedding-based or LLM-judged semantic matching would produce a more accurate automated metric, at the cost of reproducibility (exact matching is deterministic; embedding matching varies by model).
Multi-codebase validation (completed for Python/SQLAlchemy). Validation against a second production Python backend (LIITOS-AI, Phase 18) confirmed 100% with-source entity recovery and documented the intent-only limitation when planning meta-docs dominate the intent scan. The with-source pipeline now has two independent uncontrolled validations (SPENDCITY + LIITOS-AI) across two SQLAlchemy ORM conventions (db.Model/BaseModel* and Base/DeclarativeBase). The next step is validation against a codebase in a structurally different language — Go or Rust — to confirm that the source parser extension pattern generalizes across language paradigms, not just across Python ORM variants.
GAP-007 closure. OPA-compatible multi-tenant policy scoping via scope_by remains unaddressed in v0.6.
GSD-from-intent as a complementary pipeline path. The TOPOS project was built using a structured planning methodology (GSD — Get Shit Done) working directly from intent, without invoking ADL CLI gates. This constitutes a validated alternative to the CLI pipeline: the same intent-to-execution workflow, but orchestrated through structured planning prompts and documented phase plans rather than through automated gate invocations. The formal intermediate specification layer — entities, transitions, effects — functioned as a mental framework guiding construction decisions even when the CLI was not used. This suggests the ADL specification model has value beyond the CLI pipeline itself: the two-primitive discipline is sufficiently clear to guide autonomous construction when used as an organizing framework without mechanical enforcement. Future work on making the GSD path a first-class ADL workflow — with explicit spec-from-plan extraction and coverage measurement against the resulting codebase — would formalize this observation.
Adapter ecosystem. Formalizing the dispatch adapter contract and publishing it as part of the CLI package would enable third-party execution platforms to integrate ADL as a specification layer. The source parser extension pattern (Phase 9 Python, Phase 14 Dart) has been validated — adding a new language requires regex patterns for class/field/enum extraction and integration into parseSourceModels() auto-detection.
adl analyze generalization (completed). Phase 16 generalized adl analyze to derive ground truth from the ADL spec (adl.core.yaml) of each analyzed run, with optional --source flag for source-model-based ground truth extraction. The command now works against any codebase processed by adl extract, not just SPENDCITY. SPENDCITY backward compatibility is preserved. Phase 18 validated this by running adl analyze against LIITOS-AI.
This paper has presented ADL, a two-primitive formal specification language for autonomous software construction, and evaluated it in both directions: forward (specification to working software) and reverse (existing software to specification coverage) across four technology stacks.
The forward evaluation produced four enterprise applications — 40 entities, 193 passing tests, ~7,000 lines of code — entirely from ADL specifications through autonomous multi-agent construction with consensus validation. No human coding was required after intent authoring. The multi-pass Gate 5 architecture resolved a structural LLM constraint, improving file generation from 38% to 100% by decomposing scaffolding into sequential context-chained passes.
The reverse evaluation established four-stack bidirectionality across three uncontrolled and two controlled validations. Against a production Python backend (SPENDCITY), the intent-only pipeline recovered 19.5% of domain entities, revealing the structural asymmetry between forward and reverse accuracy: forward accuracy is bounded by specification completeness, while reverse accuracy is bounded by intent coverage. This asymmetry is not a failure but a diagnostic — the reverse pipeline measures the gap between what was formally specified and what was built without specification. Two controlled proofs demonstrated round-trip fidelity: 97% on Python/FastAPI (TaskFlow) and 93% on Flutter/Dart (TaskFlow). A second uncontrolled validation against LIITOS-AI achieved 100% entity recovery in with-source mode, confirming source parser generalization across SQLAlchemy conventions. A third uncontrolled validation against TOPOS (React Native/Expo, TypeScript) achieved 50% with-source accuracy — the first sub-100% result — which revealed that the entity grounding rule is language-paradigm dependent: ORM inheritance provides a clean structural boundary for ground truth classification that TypeScript's type system does not, producing a measurement artifact rather than a true accuracy gap. Gate 1 correctly classified the core domain entities; the accuracy metric was penalized by utility types in the ground truth denominator. This insight drives a future improvement: entity classification heuristics for TypeScript ground truth.
The gap catalog from reverse validation drove specification evolution across three versions (v0.4, v0.5, v0.6), demonstrating a self-reinforcing feedback loop between real-world validation and language design. Eight v0.6 constructs — adl.analytics.yaml, cache_ttl split, stub: true, extraction_effect, config_sources, derived_entity, initialization_effect, and async: true — emerged from TOPOS and LIITOS-AI findings, showing that real-world applications surface expressiveness gaps not visible in controlled validation targets.
The v1.0 declaration is grounded in the LINETBOOT-USB unseen domain validation (§6.13): an eight-attempt forward run against a domain ADL had never processed. Attempts 1-5 scored 0/7 on the original bootable-app contract. Rather than iterating indefinitely, the contract was revised to reflect what the pipeline reliably delivers: a validated formal spec plus a consistent scaffold. Gate 5.5 was introduced to address the domain complexity ceiling — syntax checking, stub fallback, and per-transition-group sub-passing. Attempt 8 scored 7/7. The eight-attempt journey and honest contract revision are not a concession — they are the methodological standard: empirical testing, documented failures, architectural response, and a verified final result.
The contribution is narrow but precise: a minimal formal specification language can drive autonomous software construction to produce validated enterprise scaffolding from 12 questions in under 30 minutes, and that same language provides a quantitative measure of specification coverage when applied to existing systems — empirically validated across four paradigmatically distinct technology stacks including the first mobile-first React Native validation, and further stress-tested on an unseen domain through eight iterations. The measurement validity finding adds a methodological contribution: accuracy metrics for specification coverage depend on the precision of the ground truth classifier, not only the pipeline quality. As autonomous agent capabilities grow, the specification problem becomes more important, not less. ADL is one answer to the question of how to make intent formal enough to test.
ADL Project, April 2026.