An AI-Scaled Method for Turning Spec Ambiguity into Confirmed Vulnerabilities
Cross-Implementation Divergence in RFC-Ambiguous Specifications: An Empirical Study
May 2026 · HUNTER Research · v0.13
Everyone audits code. Almost nobody audits the rulebook the code is built from. The internet runs on written standards — RFCs (Request for Comments — the public documents that define how internet protocols actually work; HTTP/2, TLS 1.3, DNS-over-TLS, OAuth, JWT and roughly nine thousand others are all RFCs, published by the IETF, the open standards body for the internet) — that are supposed to make independent systems agree. They don't always: the language is ambiguous, and different engineers read the same clause differently. Those gaps sit in plain sight, in documents everyone has read and nobody re-reads — until one becomes a security hole.
HUNTER is a research method (a pipeline, not a product) for going into that layer systematically, with AI doing the work no human team could do by hand. The AI does the scale-work — reading thousands of standards, spotting where the language is vague enough to cause trouble, and then quietly testing whether real-world software actually disagrees on the same input. Human discipline does the judgment — deciding what counts as a real finding, catching the cases where the AI overreached, and choosing what gets filed with vendors and what gets sent back to the standards body. The pairing is the innovation; neither half alone produces results worth trusting.
Across the nineteen different protocols HUNTER has put through this method, sixteen turned up exactly this kind of cross-implementation disagreement on identical input; three produced uniform agreement on a reading the spec leaves vague, and shipped as honest nulls (DANE/RFC 7671, TLS 1.3/RFC 8446, and DPoP/RFC 9449 — a proof-of-possession binding where five implementations got the security-critical behaviour uniformly correct). Twenty-five findings are now drafted in the project's disclosure pipeline. The first to run the full external-validation gauntlet was fixed by the maintainers of PyJWT (a library used by millions of services to check login tokens) and assigned CVE-2026-48522 by GitHub in May 2026. The other twenty-four are at various stages of coordinated disclosure. The point is not that one CVE was found — the point is that wherever an RFC leaves genuine room to interpret AND the behaviour is security-relevant, mature implementations disagree more often than not. In this work, that has happened in sixteen of the nineteen protocols tested.
Here is one of those, end to end, in concrete form. RFC 7515 §4.1.2 — the standard for JSON Web Tokens, the format most modern web services use for login sessions — says a token can include a URL pointing at the server's verification key, and that the URL "MUST resolve to a resource containing the public key." It says nothing about which URL schemes a verifier should honour. Given that one ambiguous clause, HUNTER's AI generates a crafted test input — a token whose key-URL field contains file:///etc/passwd instead of an https:// address — and feeds it to five mature JWT libraries from five different programming ecosystems (Python, Node, Go, .NET, Java). Four refuse to fetch a non-web URL by virtue of the HTTP helper they each chose; PyJWT alone reads the local file and tries to treat its contents as a cryptographic key. The disagreement is a server-side request forgery — an attacker who can present such a token can make a PyJWT-backed server hand over arbitrary local files — and the vendor confirmed and fixed it as CVE-2026-48522 in May 2026. That is what a HUNTER run actually looks like in one finding: spec clause in, crafted input generated from it by the AI, real implementations probed in parallel, disagreement caught, vendor fix shipped.
Empirical record so far — 16 of 19 tested protocols showed cross-implementation disagreement; 3 honest nulls (DANE, TLS 1.3, and DPoP) · 25 findings drafted across JWT, SAML, DNSSEC, URL parsing, DoT, OpenPGP, SIP/STIR, DANE, ACME, TLS 1.2, TLS 1.3, DPoP, HTTP Structured Fields · CVE-2026-48522 the first to clear external validation, 24 more queued.
Quick vocabulary, if you don't live in this world An RFC ("Request for Comments") is a public internet specification published by the IETF — the document that says how protocols like email transport, DNS, certificate validation, or single-sign-on are supposed to work. RFCs use three keywords with precise weight: MUST (required for conformance), SHOULD (recommended; deviations need justification), and MAY (optional). When a spec is silent on a behaviour or marks it undefined, implementations still have to choose — and different products can make different, all-technically-conformant choices. That gap is the surface this paper measures. Acronyms and protocol-specific terms are gathered in Appendix A — Glossary with external references.
Internet specifications are not executable. A line of normative text — "a verifier MUST reject", "an implementation MAY choose to omit", "this behaviour is undefined" — leaves a decision to the implementation. Where the spec uses MUST, implementations that disagree are usually wrong on one side and conformant on the other; where the spec uses SHOULD, MAY, or simply omits the behaviour, both sides can claim conformance and the disagreement becomes a security-relevant divergence between deployed systems. Figure 3 below shows how those three keywords map to predictable divergence behaviour in practice.
Figure 2 (in the opening) showed the concrete shape: the same clause reaches several engineering teams, each builds it slightly differently, and three of four mature implementations might safely reject a malicious input while the fourth quietly accepts it. The gap between codebases is the surface — and Figure 3 explains why that gap exists predictably wherever a MAY-clause or a spec-silence sits in security-sensitive code.
That surface is not new. What is new is treating it as enumerable. An LLM-driven scanner can read the entire IETF corpus — roughly nine thousand documents — and rank every normative section by an ambiguity-density score in minutes. A second, deeper analysis can then gate each candidate by asking whether the ambiguity will actually produce observable cross-implementation difference (not every ambiguity does). What used to be a depth-of-one specialist activity — a particular researcher reading a particular RFC very carefully — becomes a queryable inventory across the whole rulebook. The AI does the corpus-scale reading; human judgment decides which candidates are worth committing a milestone to. Nineteen milestones in, sixteen produced cross-product disagreement on identical input; three converged on uniform interpretations the standards body could now ratify (DANE/RFC 7671, TLS 1.3/RFC 8446, and DPoP/RFC 9449).
This is a discipline rather than a one-off finding because the same mechanic produces results repeatedly across unrelated protocols. The next two sections demonstrate it: §2 walks the one case that ran the full external-validation gauntlet (filed, accepted, fixed, CVE-assigned), then §3 names the three audiences the method serves with concrete leverage. The body of the paper then shows the pattern recurring in four more case studies (§4), the discipline holding when implementations agree (§5), and the discipline catching its own mistakes when the AI half overreaches (§6). Layer 2 (§7 onward) opens the methodology, the master table, and the limitations to the reader who wants to drill in.
In May 2026 HUNTER produced its first finding to run the full external-validation gauntlet — filed with the maintainer, accepted as a real bug, fixed in a new release, and assigned a CVE number by GitHub. The library in question is PyJWT, a Python library used by millions of services to issue and check the login tokens that hold web sessions together. The bug was small in shape and large in reach: in some configurations, an attacker-supplied login token could trick PyJWT into reading a local file off the server and treating its contents as a cryptographic key. That is the classic shape of a server-side request forgery — and it had been sitting unnoticed in a widely-used library.
HUNTER found it the same way it finds everything else: by holding PyJWT next to four other libraries that do the same job in four other programming ecosystems, giving all five the same input, and watching for who agrees with whom. The input was a token whose "where to fetch the verification key" field pointed at file:///etc/passwd instead of an https URL. Four of the five libraries refused to fetch it; PyJWT alone fetched the file and tried to use the contents as a key. The matrix below is the visual record.
file:// scheme by virtue of the HTTP-client library they chose. PyJWT 2.12.1 alone reads the local file because Python's default urllib opener carries a FileHandler in addition to the http(s) handlers. The maintainer confirmed and fixed the bug in PyJWT 2.13.0; GitHub assigned the CVE on 2026-05-22.The mechanism is simple. Python's default HTTP helper quietly accepts file:// URLs in addition to http(s). The other four libraries' HTTP helpers do not. PyJWT inherited Python's permissive default without adding a URL-scheme check; the four others inherited safer defaults by choosing HTTP clients that validate the URL scheme before fetching. The structural point — the part that matters for the rest of this paper — is that the bug lived in the disagreement between five reasonable engineering choices, not inside any single team's codebase.
The disclosure path ran end-to-end without intervention: filed 2026-05-06 as GHSA-993g-76c3-p5m4 via GitHub's private vulnerability report; maintainer-confirmed and accepted 2026-05-21 with a scope refinement to the URL-scheme exposure; fixed and released in PyJWT 2.13.0 the same day (bundled with four other security fixes in the release); CVE assigned by GitHub as CVE-2026-48522 on 2026-05-22, publishing to the CVE List and the GHSA database. The class precedent — CVE-2024-21643 against Apache Jena, the same JKU-trust class with CVSS 7.5 — anchored the prior-art recon at the start of the comparator-matrix work and confirmed that no existing CVE, issue, PR, or silent-fix already covered PyJWT.
The structural pillar of the methodology claim lives here. The original HUNTER report carried the same shape of comparator matrix that the methodology produces on every synthetic target: same input, N independent mature implementations, one outlier, oracle-grounded verdict. That matrix shape — found in the wild on a library used by millions of downstream consumers, then confirmed and fixed and CVE-assigned by an external authority — is the unfalsifiable proof that the method finds real bugs, not academic artifacts. The TRACK-1 class survey in §6.4 bounds this claim honestly: PyJWT was the live nugget in a largely-mined seam (seven prior CVEs already closed in the same JKU/key-fetch-SSRF class), not the start of a virgin vein. The methodology's value is that it found the live nugget systematically, not opportunistically.
The same mechanic — predict where the rulebook is ambiguous, drive multiple real implementations against the same input, record who agrees with whom — serves three different audiences with three different concrete outcomes.
For a security team that adopts the discipline. Static analysis, dependency scanning, and fuzzing all assume the bug lives inside one codebase. HUNTER is the complementary surface: the bug lives in the disagreement between codebases — when Library A and Library B both look correct in isolation but build the same RFC clause differently, and the input that confuses Library B isn't anomalous in any way Library B's authors would notice. The PyJWT case is the canonical demonstration. PyJWT's code was idiomatic Python; urlopen with a URL pointer is what every Python tutorial does. Nothing in PyJWT's codebase would flag a static analyser. The bug only became visible by holding PyJWT next to four other libraries that all refused the same input. Comparator-matrix testing finds these.
For a standards body that consumes the output. When implementations agree on something the spec leaves ambiguous, that uniform behaviour is a candidate for ratification. When they disagree, the divergence is a candidate for clarification. Either way, the per-finding matrix is concrete evidence rather than anecdote. The methodology produced three such uniform-agreement results — Milestone Q (DANE/RFC 7671), Milestone T (TLS 1.3/RFC 8446), and Milestone U (DPoP/RFC 9449) — all shipped as STOP-5 UNIFORM honest nulls rather than vulnerability disclosures. §5 walks all three. The point is that "uniform reading" is reported as a legitimate methodological yield, not as a failure mode.
As a concrete AI-capability example. HUNTER is not "AI finds bugs" — that framing collapses in three minutes once a reader looks closely at what the AI is actually doing. HUNTER is AI doing the work a human team can't do at hand-scale (reading the entire library of internet standards, spotting ambiguous clauses, generating test inputs, running side-by-side comparisons across many real implementations at once) paired with a human layer that catches the cases where the AI got over-confident (deciding what is actually a real finding, what's a false alarm, what gets shipped to a vendor and what doesn't). §6.1 and §6.2 walk two cases where the AI's first-pass verdict was wrong and the human caught it — that is not a critique of the AI, it is evidence that the pairing works. The methodology's value as an AI-capability story is precisely the pairing, not either half alone.
The rest of the paper is the demonstration: §4 shows the pattern recurring across four more protocols; §5 shows the discipline holding when implementations agree; §6 shows the discipline catching its own mistakes. Layer 2 (§7 onward) opens the machinery to the technically inclined reader.
FINDING-031 / CVE-2026-48522 (§2) is the case that ran the full external-validation gauntlet. It is not the only case where the methodology produced the same shape of cross-implementation divergence. Four more case studies follow, each from a different protocol family — DNS-over-TLS pin enforcement, URL parsing across language standard libraries, SIP/STIR Identity credential discovery, and OpenPGP self-signature handling. Implementation identities are masked as Implementation A / B / C / D / E for the unfiled findings; published prior-art CVEs are named where they reference already-public software. The structural lesson is in the repetition: same method, same shape of result, across protocols a single-protocol specialist would not normally cross-test together.
RFC 7858 §3.2 and RFC 8310 §5.1.3 specify SPKI pin enforcement for DNS-over-TLS clients in SHOULD/MAY language: "A DNS client MAY discard any connection and try a different resolver if none of the previously validated/pinned public keys match." The scanner flagged the candidate at composite_score=0.85 with the may_choose marker, and the per-candidate analysis returned parser_differential=TRUE with hypothesised severity medium. The Phase-A smoke tested five DoT resolver implementations against a forged-zone scenario where the offered server certificate was CA-signed and presented the correct subject but did not match the configured SPKI pin.
The 4×5 matrix produced one DIVERGENT cell on the primary probe row:
P1 SPKI pin mismatch oracle = reject
Implementation A ACCEPT-INSECURE (no SPKI feature; opportunistic mode only)
Implementation B ACCEPT-INSECURE (pin-enforcement config option does not
exist in forward-zone block)
Implementation C ACCEPT-INSECURE (pin-enforcement absent in ESV branch;
added in a later major)
Implementation D ACCEPT-INSECURE (pin parameter accepted without syntax
error but the daemon logs the parameter
as "unknown" and proceeds; CWE-693 shape)
Implementation E REJECT (SERVFAIL) (policy enforcement engaged;
TLS alert BAD_CERTIFICATE)
Each non-enforcing implementation has a distinct root cause for the non-enforcement. Implementation A has no SPKI pin support at all in the audited version. Implementation B's pin-enforcement directive does not exist in the forward-zone configuration block. Implementation C does not implement pin-sha256 in its TLS block in the audited ESV branch; the feature was added in a later major. And Implementation D accepts the pin parameter without a syntax error but the daemon log explicitly says it is ignored — a silent-ignore against a parameter the user supplied with reasonable expectation of pin enforcement. The Implementation-D behavior is qualitatively distinct from the other three: it is the CWE-693 (Protection Mechanism Failure) pattern — the protection mechanism appears configured but is not engaged. The other three are feature-absence or version-gap cases that a documentation reader could in principle have caught.
FINDING-023 through FINDING-026 stage one INBOX entry per non-enforcing implementation; each carries an implementation-specific root-cause field and a track classification. FINDING-026 (Implementation D, the silent-ignore case) is the highest-actionability entry. The other three are spec-clarification-class with config-gap framings, or divergence-by-design under the RFC 7858 opportunistic profile.
RFC 3986 §3.2 specifies the authority component but is silent on the precise semantics of percent-decode-before-or-after the authority/userinfo split. The WHATWG URL Standard — the living spec that browsers actually follow — defines a fixed parser order that diverges from RFC 3986. The two specifications diverge in observable ways, and library implementations cluster into a WHATWG-aligned camp and an RFC-3986-strict camp.
The Phase-A smoke tested eleven URL-parsing libraries across seven language ecosystems against five probe classes. All 55 cells (5 probe families × 11 libraries) returned a DIVERGENT label. The strongest single-finding signal in the project's history is on the F_url_authority_userinfo row: six widely-deployed language standard-library parsers ACCEPT authority percent-encoded slashes — the exact regression class that the published reference implementation libcurl has explicitly patched since 7.83.0 as CVE-2022-27780. The vulnerable parsers span Python, Node legacy, Java, Ruby, and PHP standard-library ecosystems. libcurl 8.5.0 is the conformant reference: it returns host=None on the same input — the published CVE-2022-27780 patch holds.
FINDING-021 stages a composite entry against the six vulnerable parsers with CWE-918 (SSRF) primary, CWE-444 (Inconsistent Interpretation) secondary, and CWE-178 (Improper Case-Sensitivity) tertiary. The single specific root cause — RFC 3986 §3.2 silence on percent-decode-before-or-after authority split — persists across six libraries that never fixed the regression class libcurl patched in 2022. FINDING-022 stages a separate entry for the WHATWG-vs-RFC IPv4 normalization divergence with a worked SSRF-allowlist bypass: an octal-encoded IPv4 host (e.g. https://0177.0.0.1/) is normalised to the canonical decimal form by the WHATWG-aligned parsers, but is preserved as an octal-host string by the RFC-3986-strict parsers. A downstream allowlist that compares against the canonical string fails the bypass; an allowlist that resolves the host first does not. FINDING-022 is (b)-class spec-clarification: RFC 3986 §3.2.2 does not mandate IPv4 normalization at the parser layer; the question is whether the WHATWG normalization should be ratified as the RFC-3986-bis default.
RFC 8224 §7.2 specifies SIP Identity credential discovery in implementation-specific language: "SIP entities SHOULD discover this credential by dereferencing the 'info' parameter, unless they have some implementation-specific way to acquire the appropriate credential." The scanner flagged the candidate at composite_score=0.87 with the implementation_specific marker and the per-candidate analysis returned parser_differential=TRUE with exploit-class credential source confusion. The Phase-A smoke tested five SIP daemons — three functional implementations of STIR/SHAKEN and two negative controls (one with the relevant module unreachable through the local test proxy; one without any STIR/SHAKEN extension at all).
This case study is the canonical worked example of the positive-control primitive. The verdict required three Phase-A runs:
request_count = 0 across the entire run. The REJECTs were transport-layer bounces — one daemon failed at the CLI invocation layer with "Connection refused"; another emitted a parse error on a header field; a third errored at the harness layer on two probe rows — and no implementation had actually reached the credential-validation code path. UNIFORM-REJECT looked identical to "all implementations correctly rejected" but was actually "no implementation reached the code under test". The disqualifying signal (zero mock-authority requests on a probe family whose entire purpose is to dereference an external URI) was present in the autonomous executor's own output but had not been surfaced as disqualifying — consistent with the project's documented end-of-context-window degradation pattern. The PM review caught it; the verdict was rejected; a harness repair pass was queued.FINDING-030 stages a (b)-class spec-clarification entry. The disposition is deliberate: both implementations are mature carrier-grade SIP daemons and neither is yet proven correct; the divergence on identical input is unambiguous evidence of either an implementation bug or an RFC 8224 §5.3.1 / RFC 8225 §5 / §6.2 PASSporT-reconstruction spec gap. The filing target is the relevant IETF working group rather than per-vendor CVE channels until the root cause is isolated.
The methodological observation is independent of the finding itself: the positive-control gate is what made the verdict credible. Without the request_count = 0 disqualification rule, the Run-2 STOP-5 UNIFORM would have shipped as a methodology-valid negative result, the harness bug would never have been caught, and the real divergence in Run 3 would have remained invisible. The PM-side rejection of an autonomous-pipeline verdict is the first instance in the project's record where the human-in-loop quality gate caught a false-positive the autonomous layer would otherwise have shipped — framed not as embarrassment but as evidence the verdict model is sound under stress. The post-incident discipline upgrade has promoted the positive-control gate from per-pack hygiene to a framework-level precondition: any probe family where "uniform negative" is a possible outcome must demonstrate a non-zero reached-the-path signal before any verdict can be emitted.
RFC 9580 §10.2 describes the Transferable Secret Key structure with a permissive clause: "An implementation MAY choose to omit the self-signatures, especially if a Transferable Public Key accompanies the Transferable Secret Key." The scanner flagged the candidate at composite_score=0.85 with the may_choose marker and the per-candidate analysis returned parser_differential=TRUE with exploit-class key property stripping / binding signature bypass. Two published prior-art CVEs anchor the target class — CVE-2021-23992 (Thunderbird/RNP — no self-signature check on the user ID binding) and CVE-2021-3521 (RPM — no binding-signature check on subkeys). The Phase-A smoke tested five OpenPGP implementations.
The 4×5 matrix produced a three-way split on the primary probe row:
P1 no-self-sigs TSK fixture: tsk-no-self-sigs.pgp
Implementation A REJECT (strict-reject; logs "no user
ID — skipped")
Implementation B REJECT (strict-policy reject; unknown
capability flags)
Implementation C ACCEPT-CP marked INVALID (imports the TSK but flags
both keys [INVALID] — partial
mitigation post CVE-2021-23992;
downstream consumer must check
the marker)
Implementation D ACCEPT-CP default zero-flags (imports; primary keyflags=0x00;
no default substitution but
no refusal either)
Implementation E ACCEPT-CP no flags (imports via library API;
kf=none; library-not-application
framing — downstream apps using
raw-parse expose the
vulnerability even if the
library itself does not enforce)
The three-way split is the load-bearing structural signal. Strict implementations (A, B) refuse the TSK outright. The partial-mitigation implementation (C) accepts the TSK but marks it INVALID — the CVE-2021-23992 patch holds at the import gate but the key is persisted, and downstream consumers reading the keyring may not filter on the marker. Vulnerable-pattern implementations (D, E) accept the TSK with no marker and no flags — the pattern that CVE-2021-3521 closed in one specific downstream user (RPM) persists in two widely-deployed library implementations. The empirical confirmation arrived in the same run as the methodology prediction: Tier-1 direct-precedent CVEs identified at the recon step → STOP-4 DIVERGENT at smoke matching the predicted class.
FINDING-027 (Implementation C) is (b)-class spec-clarification because RFC 9580 §10.2's permissive language explicitly permits the INVALID-marker interpretation. FINDING-028 and FINDING-029 (Implementations D, E) are (a)-class CVE candidates because the spec-permissive language does not extend to silent default-permission substitution; the downstream reach of these two implementations across consumer email/secure-messaging stacks and across signed-Java-artifact toolchains respectively makes these the most-actionable entries staged in the INBOX during the period covered by this paper.
Three of the nineteen gated milestones produced uniform agreement instead of divergence: Milestone Q against DANE/TLSA Usage 2 reconstruction (RFC 7671 §5.2.2), Milestone T against TLS 1.3 trust-anchor handling (RFC 8446 §4.4.2, grounded in RFC 5280 §6.1.1(d)), and Milestone U against OAuth DPoP token-type resolution (RFC 9449 §5). All three shipped as STOP-5 UNIFORM honest nulls — the method's verdict that the specification is ambiguous, the implementations have nevertheless converged on a single reading, and the appropriate downstream artifact is a standards-body clarification rather than a vulnerability disclosure. Together they are the evidence that the method's negative results are trustworthy. A method that only produces a result when something is wrong cannot be trusted on its silences; a method that produces uniform-agreement when nothing is wrong demonstrates the verdict shape genuinely distinguishes the two.
Q was the first milestone in the audit-driven sequence to produce uniform agreement. The candidate had passed every gate that the divergent milestones passed — high composite_score (0.83), parser_differential=TRUE, ≥2 independent mature implementations, ≥4 probe families. The Phase-A smoke against the two DANE-native libraries (OpenSSL 3.0.13 + GnuTLS 3.8.3, with three additional negative-control daemons to confirm the test substrate) returned uniform results across the entire matrix: both implementations ACCEPT-VIA-DANE-RECONSTRUCTION on the primary probe, both REJECT-CHAIN-INCOMPLETE on the negative probes, both UNSUPPORTED-NO-DANE-SUPPORT on the fallback probe.
The two mature DANE-native libraries implement RFC 7671 §5.2.2 reconstruction identically — same authentication outcomes, same chain-validation outcomes, same DANE-support detection. The spec ambiguity is real; the ecosystem has converged on a single interpretation. The appropriate downstream artifact is a spec-revision recommendation to the relevant IETF working group to ratify what implementations actually do, rather than a vulnerability disclosure. No finding draft was staged.
The credibility of the null is what distinguishes a STOP-5 UNIFORM from a HARNESS-ERROR. The positive-control gate (§7.6) requires that at least one cell in the matrix demonstrate the implementation actually reached the code under test. In Q's case, the primary probe's ACCEPT-VIA-DANE-RECONSTRUCTION on both implementations is itself the positive control: a non-trivial chain reconstruction that required the implementation to fetch, parse, and validate the TLSA record. Uniform agreement on that primary probe shows the implementations did the work; uniform agreement on the negative probes shows they did the work consistently.
Milestone T tested whether mature TLS 1.3 implementations enforce constraints that RFC 5280 §6.1.1(d) explicitly excludes from path validation. The relevant normative text restricts the trust-anchor information used during validation to issuer name, public key algorithm, public key, and optionally public key parameters — a trust anchor's own Basic Constraints, Key Usage flags, and other extension fields are not inputs to TLS 1.3 path validation by spec. The Phase-A smoke against five TLS implementations (OpenSSL, GnuTLS, Go crypto/tls, rustls, plus one negative-control) produced uniform agreement across all five probe families: 5/5 ACCEPT on the primary probes that depend on trust-anchor extension fields the spec says are out of scope, 5/5 REJECT on the negative probes that test constraints which ARE in scope.
T is where the oracle-validity doctrine (§6.2) was applied prospectively. In Milestone S the autonomous pipeline had nearly mis-flagged a structurally identical 5/5-ACCEPT pattern as a uniform failure — five implementations broken on Basic Constraints! marquee finding!. The Task-0 verification caught it: the oracle was wrong, the implementations were correct, the constraint the pack thought it was testing was explicitly excluded by RFC 5280 §6.1.1(d). In Milestone T, the lesson was applied from the start. The oracle was RFC-grounded against RFC 5280 §6.1.1(d) before any probe ran. The 5/5-ACCEPT cells were read correctly the first time as uniform-CORRECT, not as a uniform failure to investigate. No marquee re-interpretation, no HYBRID disposition, no Task-0 rescue — the discipline did its work invisibly because the oracle was right.
That distinction matters for the credibility argument. Q is the discipline working from the start of the post-PROMOTE audit-driven sequence: when DANE was probed, the gate stack was in place, and the result was an honest null. T is the discipline surviving its own near-miss: S taught a lesson, the lesson was codified as oracle-validity, T applied the codified lesson without needing it caught a second time. A method's negative-result trustworthiness depends on whether the discipline producing those nulls is operational and recurring — not just recoverable when something almost goes wrong.
Milestone U targeted the proof-of-possession bypass surface in OAuth DPoP — the security-critical reason DPoP exists. RFC 9449 §5 specifies token-type resolution for DPoP-bound access tokens; the scanner flagged the surface because the spec leaves room for implementations to differ on how the DPoP proof is validated against the access token. The Phase-A smoke tested five OAuth/OpenID implementations (pool verified-independent: Keycloak ships zero Nimbus code, so Spring Security and Keycloak are independent by construction). Forty probe cells across five implementations produced uniform REJECT on every proof-of-possession bypass attempt — 0-of-40 accepted. DPoP proof-validation is robust across the ecosystem.
U is the third honest null after Q and T, and the one with the sharpest security relevance. Q tested a certificate-reconstruction path; T tested a trust-anchor extension scope. U tested a proof-of-possession bypass — the exact attack surface that DPoP was designed to close. A uniform-CORRECT result on the marquee security surface of a proof-of-possession protocol is a GOOD security result, honestly reported as a null. The appropriate downstream artifact is the same as Q and T: a standards-body note that the ecosystem has converged, not a disclosure.
Q + T + U together show three shapes of null across three unrelated protocol families (DNS, TLS, OAuth): doctrine in place from the start (Q), doctrine surviving its own learning curve (T), and doctrine applied to a security-critical bypass surface where a divergence would have been a serious finding (U). Three nulls are pattern, not noise — and the credibility of the method's positive results rests partly on the demonstrated willingness to report "nothing is wrong here" when that is what the evidence says.
STOP-5 UNIFORM matters because it answers the credibility question a discerning reader would otherwise ask: does this method only produce results when something is wrong? The Q + T + U triple is the empirical answer — same discipline, same gates, sometimes the result is "everyone agrees and the standards body should ratify that." A method whose null results are trustworthy strengthens the credibility of its positive results.
Two episodes in the post-PROMOTE sequence produced first-pass verdicts the autonomous pipeline would otherwise have shipped — and the human discipline gate caught both. A third case (FINDING-010) recorded a maintainer pushback faithfully rather than spinning it as a finding. A fourth (the TRACK-1 class survey) bounds the apex FINDING-031 claim honestly. A fifth — the substrate-exercise-verification thread, now documented across three instances in Milestones U and V — shows the method catching its own substrate gaps in both directions: false negatives (a build error and a class-name bug that produced fake "results") and a would-be false positive (a candidate divergence confirmed genuine before filing via a decisive cross-check that ruled out a serialisation artifact). The cumulative weight of these five cases is that the method's negative space — what it refuses to ship — is at least as load-bearing as its positive results.
Milestone R targeted ACME client chain-persistence at RFC 8555 §9.1 across five clients. Phase-A Wave 4 ran against a replay substrate (a mock ACME server that returned byte-identical responses to all five clients) and produced a clean STOP-5 UNIFORM verdict: every cell ACCEPT, positive control PASS, marker satisfied. By every doctrine in place at the time, the verdict was a legitimate methodology output. The PM review rejected it.
The reasoning was structural. A replay substrate feeding byte-identical chain responses to all clients by construction cannot surface persist-byte divergence regardless of implementation differences — the marker (chain_persisted_marker=true) was satisfied by the wiring, not by the differential surface being exercised. Wave 5 re-ran the smoke against a live ACME issuer (Pebble): same probe set, same impl pool, same gate evaluators, same fixtures — only the substrate changed — and produced real STOP-4 DIVERGENT. Posh-ACME persists leaf+intermediate to disk; the four Unix clients persist the full chain. The divergence was real; the Wave-4 substrate had hidden it by construction.
This is the marquee story for the methodology's substrate-adequacy doctrine, codified after Milestone R: "a verdict is a methodology output only if the substrate structurally permits the differential the pack is designed to detect." Positive control PASS is necessary but not sufficient. The Wave-4 verdict would have shipped without the PM-side intervention; it would have been wrong; the doctrine that prevents the same class of error in future milestones was born from the catch.
Milestone S targeted TLS 1.2 trust-anchor handling at RFC 5246 §7.4.2 + RFC 5280 §6.1 across five TLS implementations. The autonomous pipeline's first-pass compute_verdict returned STOP-4-CANDIDATE-VALIDATE-ONLY on the P2.x sub-probes and STOP-5-UNIFORM on the P1.x sub-probes. The narrative re-interpretation that the assistant pushed PM toward read: "all 5 implementations broken on Basic Constraints; marquee finding."
The PM held the verdict at HYBRID disposition and queued a Task-0 fixture-oracle verification. The Task-0 result: the P1.x oracle was a fixture-oracle artifact, not a real divergence. RFC 5280 §6.1.1(d) explicitly excludes the trust anchor's own Basic Constraints from the trust-anchor information used in path validation. The 5/5 ACCEPT on P1.x was correct conformant behaviour against the actual normative text — not a divergence and not a finding. The marquee re-interpretation would have been a phantom finding. The genuine P2.3 divergence (GnuTLS 3.8.3 alone fails to enforce Name Constraints permittedSubtrees URI-form) survived as FINDING-033.
This is the marquee story for the methodology's oracle-validity doctrine, codified after Milestone S: "a verdict that depends on a fixture's MANIFEST oracle MUST have the oracle RFC-grounded and the fixture byte-verified before a uniform-against-oracle result is claimed as a finding; N-implementation agreement against the oracle is prior evidence the implementations are correct." The probability that five mature TLS implementations share an identical hidden bug is small; the probability that an oracle constructed by pack authors from bibliographic recon was wrong is non-trivially larger.
R and S together establish a pattern. The autonomous pipeline twice leaned toward the more dramatic interpretation in two consecutive milestones; the human verdict gate caught both. Two instances in two consecutive milestones is pattern, not noise — and it is the explicit basis for the project's autonomous=false binding at the verdict-shipping step. The discipline is the pairing of AI scale-work and human judgment; the value of the human half is what these two stories demonstrate.
FINDING-010 was filed 2026-05-15 as GHSA-p9jg-fcr6-3mhf against ongres/scram — a PostgreSQL SCRAM channel-binding library that, in channelBinding=REQUIRE mode, silently downgrades to non-PLUS when the server presents a modern signature algorithm whose OID the library's parser doesn't recognise. Two downstream Java client libraries (r2dbc-postgresql and vertx-pg-client) inherit the behaviour.
The maintainer's response is the canonical instance of a defensible counter-argument. Jorge / OnGres replied 2026-05-20: "r2dbc-postgresql and vertx-pg-client are not vulnerable when channelBinding=REQUIRE connection string is absent; these clients implicitly use the default 'PREFER' method, which is a best-effort use of channel binding; they are not vulnerable in this scenario as they do not require channel binding to function." The maintainer defended PREFER mode; the original filing addressed REQUIRE mode. Different scope.
The methodology's response was to record the scope correction faithfully in the project's publication notes, not to re-argue the maintainer's frame. The disclosure record carries both the original filing and the maintainer counter-argument verbatim. Recording maintainer pushback honestly is a credibility multiplier: a paper that reports every divergence as a confirmed vulnerability would not survive review by a sharp engineer who can construct the same PREFER-vs-REQUIRE distinction. Reporting the pushback explicitly is what makes the rest of the disclosure record trustworthy.
The TRACK-1 class survey on JKU/key-fetch-SSRF (the class that produced FINDING-031 → CVE-2026-48522) closed 2026-05-22 with a deliberately conservative reading. Seven candidates in the exact same class were already remediated when the survey began: PyJWT (now patched by FINDING-031), Apache Jena (CVE-2024-21643), WildFly Elytron (CVE-2024-1233), OpenSAML pre-3.2.4, Keycloak (CVE-2026-1180), Spring Authorization Server (CVE-2026-22752), Centrifugo (CVE-2026-32301). The remaining virgin pair in Java (jose4j, auth0/jwks-rsa-java) were chassis-capped — their HTTP client was an HttpURLConnection cast that blocks file:// at the connect step before the JKU URI ever reaches the verifier. The python3-saml library, which ships the same urllib chassis as PyJWT, carries a README caller-responsibility warning that materially weakens the defect framing (PyJWT had no such warning, which is why the maintainer accepted and fixed it).
The conclusion the project recorded: FINDING-031 was a late nugget in a largely-mined seam, not the start of a rich vein. The discovery itself is a whitepaper deliverable — the class was checked systematically; it is mostly remediated; PyJWT was the one live finding remaining. That bounds the FINDING-031 real-world-validation claim honestly. The method found one CVE in a class that nine prior CVEs had already partially closed; not "AI found an entirely new class of bug" but "AI did a systematic survey, found the one remaining unfixed instance in a known-vulnerable class, and the finding ran the full external-validation gauntlet." Reported the way the empirical record supports.
The R and S stories above (§6.1, §6.2) are each a single instance of the human gate catching an autonomous-pipeline overreach. A separate thread — now documented across three instances in Milestones U and V — shows the method catching substrate-level gaps that would have produced either fake results or fake confirmations, and the correction running in both directions between the human verdict-authority and the autonomous executor.
Instance 1: a build error producing fake markers (Milestone U, atproto crate). A Cargo crate-name typo in the Rust DPoP wrapper meant the wrapper never built against the real library. It emitted "not-applicable" markers on every cell — markers that looked like legitimate negative results but were actually build-failure artifacts. The gap was caught at Wave 3.5 when the substrate-exercise-verification check (does this wrapper actually invoke the real implementation method?) was applied. The typo was fixed, the wrapper rebuilt, and the cells re-run. Without the check, the markers would have entered the matrix as legitimate data.
Instance 2: a class-name bug producing fake labels (Milestone U, Spring Security). The Spring DPoP wrapper used a wrong fully-qualified class name, so the wrapper emitted a label ("DPoP proof validated") without ever invoking DPoPAuthenticationProvider.authenticate() — the method that actually performs proof-of-possession validation. Separately, the P5 probe's own positive-control marker was hardcoded label-logic rather than a genuine exercise of the validation path. The PM had wrongly inferred substrate-adequacy from that P5 marker — the autonomous executor's source-level inspection corrected the PM's reasoning, not the other way around. Both gaps were caught at Wave 4.5 and fixed.
Instance 3: a candidate divergence confirmed before filing (Milestone V, structured-headers JS). Milestone V's scan flagged a candidate type-confusion divergence: the structured-headers JavaScript library appeared to lose the RFC 9651 Integer-vs-Decimal type distinction where five other parsers preserved it. The initial signal could have been a wrapper-serialisation artifact (JavaScript's Number type collapsing 2.0 to 2 at the serialisation boundary rather than at the parser level). Before the divergence was filed, a decisive cross-check was run: the value 2.5, which cannot be confused with an Integer regardless of serialisation, was fed through the same path. The library preserved 2.5 as Decimal — confirming that the parser does distinguish fractional values from integers, but collapses integer-valued Decimals like 2.0 to bare Integer 2. The divergence was genuine at the parser level, not a wrapper artifact. On this instance the PM had initially called it an artifact; the executor's source-level analysis proved it genuine — the second time in the same session that the executor corrected the PM.
The three instances together are the single most differentiating honesty point in the project's record. The method catches its own substrate gaps in both directions — false negatives that would have produced empty or fake data (instances 1 and 2), and a would-be false positive that was confirmed genuine before filing (instance 3). The human/automation check is genuinely two-way: the PM's judgment was corrected by the executor's source inspection twice (U/P5-substrate and V/initial-artifact-call), not just the other way around. A method that only catches mistakes in one direction — human catching AI — is a supervision architecture. A method where the AI's source-level evidence also corrects the human's judgment calls is a collaborative architecture, and that collaboration is what makes the verdict pipeline trustworthy. A faithfully recorded "we nearly mis-called this twice and caught it" is stronger credibility than a clean record that invites the question of whether the clean record was real.
The scanner operates over a snapshot of the IETF corpus and ranks individual sections by a composite score that combines spec-silence markers (categorical absence of normative language on a behaviour the implementation must decide), permissive markers (MAY choose, may be omitted, at the discretion of, implementation-defined, implementation-specific), and linguistic markers that flag soft-norm constructions adjacent to a security-relevant decision. The scanner is deliberately a spec-silence detector, not a MUST-violation detector — it surfaces sites where the text leaves a question open, not sites where the text mandates a behaviour the implementation then violates.
Each candidate row carries a composite_score in [0,1], a marker_class (hard for categorical absence and permissive markers; soft for linguistic markers), and a section anchor. The scanner does not produce a security claim. It produces a queryable inventory of sites worth empirically probing.
The scanner produces two different signals, not one, and conflating them is a load-bearing methodological error. composite_score ranks RFC-text ambiguity density — how much soft-norm language is packed into a section, weighted by marker class. A second per-candidate analysis, run as a separate step before any milestone commits to a target, assesses whether that ambiguity will produce observable cross-implementation divergence: a parser_differential flag, a hypothesised severity, and an exploit-class label. These are different signals. A high-density section whose ambiguity is internal-memory-representation or whose normative text is actually an unambiguous MUST-NOT (the density scorer is fooled by the proximity of MAY-language references; the deeper analysis is not) will produce a guaranteed STOP-5 UNIFORM and a wasted milestone.
The selection primitive is therefore: confirm parser_differential = TRUE before committing a milestone to a target; never select on density score alone. Empirically, this gate is load-bearing. The most recent target-selection cycle in the project's record disqualified the three highest-composite-score net-new candidates (RFCs covering a logotype extension, an SSH host-key algorithm fallback, and a Kerberos string-to-key derivation) because each, on deeper analysis, was either an unambiguous MUST-NOT (no divergence to observe), an implementation-defined-fallback that is standard spec pattern (no parser-differential surface), or an internal-memory-representation ambiguity (no wire-format observability). The real target — a DANE TLSA Usage 2 surface at composite_score = 0.83, lower than all three disqualified candidates — was found only by running the deeper analysis down the queue until a parser-differential positive surfaced. All shipped real-divergence milestones in the corpus had parser_differential = TRUE at this gate. Several higher-density candidates were correctly skipped because their analysis showed no observable-divergence potential.
The methodology runs two complementary candidate-selection tracks. Track 1 (scanner-driven) picks targets from the top-N rows of the scanner database at a given composite_score threshold and passes them through the parser-differential gate; this biases toward spec-clarification findings because the scanner detects ambiguity, not normative violations. Track 2 (audit-driven) picks targets from a CVE-lineage inventory — a library class that has produced prior vulnerabilities in the same probe surface — and tests whether the structural defect recurs in adjacent or descendant code paths. Track 2 biases toward CVE-class findings because the target-selection criteria pre-filter for surfaces where prior vulnerabilities exist.
Both tracks share the downstream pipeline: comparator-matrix lock, Phase-A smoke, verdict emission, finding-draft staging, and per-milestone retrospective. The discipline that governs track-2 target selection (≥2 prior CVEs in the surface; ≥1 normative text anchor; ≥5 actively-maintained implementations; ≥4 probe families before lock) was formalised after the third milestone and ratified after the fifth. FINDING-031 is the canonical Track-2 outcome: a CVE-lineage class that produced a confirmed real-world CVE; the TRACK-1 survey in §6.4 bounds the apex claim honestly.
Each milestone produces a pack: a self-contained directory containing per-implementation environment directories, a probe-class inventory, deterministic fixture data, and a verifier wrapper for each implementation. The harness shape varies with the spec shape. Wire-protocol families (DNS-over-TLS, SMTP STARTTLS, LDAP, SAML signed-document verification, SIP, ACME) run multi-process daemons under controlled network conditions with a mock authority server (or a live in-test issuer where substrate-adequacy requires it — see §7.7) that emits forged, expired, or otherwise out-of-policy artifacts. Library-API families (JWT/JOSE, OpenPGP packet parsing, URL parsing, CBOR) drive in-process verifier wrappers across language ecosystems. Both shapes converge on a uniform output shape: one cell per (probe family, implementation) pair, labelled with the implementation's verdict and the matrix's oracle (the spec-derived expected behaviour — reject, accept, accept-or-reject-consistently, observable-mismatch).
A Phase-A smoke run terminates with exactly one of three outcomes:
ACCEPTs an input the oracle classifies as reject (or the inverse), with at least one other implementation behaving differently on the same input. Triggers row-level disclosure-track decomposition and finding-draft authoring in the project's findings/INBOX.json. Downstream disposition — file as CVE, file as spec-clarification draft to the relevant IETF working group, or document as a null result with maintainer-correct-by-design framing — is a human-cadence decision per finding.Distinguishing STOP-5 UNIFORM from HARNESS-ERROR is not always trivial. A run where every implementation REJECTs looks identical to a run where every implementation's input is rejected by an earlier transport-layer failure that never reached the code under test. The principle is therefore: a probe run must prove it reached the code under test before any verdict is credible. For a dereferencing probe family (one whose entire purpose is to elicit an implementation-side fetch of an external artifact), at least one cell must show evidence the implementation actually fetched and processed the artifact — a non-zero reached-the-path signal, typically counted as requests landing at a mock authority server. Without that signal, a uniform-reject output is just as likely to be transport-bounce as it is to be the implementations correctly rejecting.
The primitive's value was first established empirically by the SIP Identity milestone (§4.3) where a three-pass harness-repair arc — false PARTIAL, then false UNIFORM rejected at the project's Wave-4-Gate-1 review because every probe ran with zero requests to the mock cert authority, then real STOP-4 DIVERGENT once a positive control was established — caught a verdict the autonomous pipeline would otherwise have shipped. The gate has been promoted from per-pack discipline to a methodology-level precondition: a probe family whose expected behaviour is purely a binary yes/no decision with no side-effect cannot use side-effect counting as its positive control, but it can use a baseline-sanity row — at least one cell where the harness drives a known-conformant input that the implementation is expected to and observably does ACCEPT. Either form of positive control is sufficient. The absence of either is disqualifying.
The target-selection and positive-control gates above are necessary but not sufficient. A pack can pass both and still ship a verdict that is not a methodology output. Three additional doctrines, each born from a specific empirical failure the gates did not catch, together condition verdict credibility on conditions upstream and downstream of the gates themselves.
Substrate-adequacy. A verdict is a methodology output only if the substrate structurally permits the differential the pack is designed to detect. A replay substrate feeding byte-identical responses by construction cannot surface persist-byte divergence regardless of implementation differences — the positive-control PASS is satisfied by wiring, not by the surface being exercised. Operational gate: at pack-authoring, an explicit does this substrate permit the differential we're testing for question before any Wave runs. §6.1 walks the marquee story (Milestone R: replay Wave-4 false-uniform caught at PM review; live Wave-5 produced real STOP-4 DIVERGENT).
Oracle-validity. A verdict that depends on a fixture's MANIFEST oracle must have the oracle RFC-grounded and the fixture byte-verified before a uniform-against-oracle result is claimed as a finding. N-implementation agreement against a constructed oracle is prior evidence the implementations are correct and the oracle is miscalibrated, not that all N share an identical undiscovered bug. The probability of N mature implementations sharing an identical hidden defect is small; the probability of an oracle constructed by pack authors from bibliographic recon being wrong is non-trivially larger. §6.2 walks the marquee story (Milestone S: P1.x oracle miscalibration caught by Task-0 verification grounded in RFC 5280 §6.1.1(d)).
Infra-testability. A scanner candidate is a viable HUNTER target only if the project's own infrastructure can actually exercise the divergence under crafted input, offline, without depending on live third-party services. High composite_score does not override this. A candidate whose surface is configuration-defined-but-not-crafted-input-testable (an SSH host-key-algorithm fallback in an SSH daemon, a Kerberos string-to-key derivation in an Active Directory deployment) is correctly disqualified at target-selection regardless of its ambiguity density. Empirically: every shipped HUNTER finding in the project's record was surfaced by execution, not by analysis. The corollary: pure analysis of an RFC's ambiguity is hypothesis generation, not result production.
Substrate-adequacy and oracle-validity together form the precondition stack at the substrate × oracle product. Infra-testability moves the gate further upstream: the candidate's candidacy is conditional on being exercisable at all. Each of the three doctrines was added after an empirical failure the prior gate stack did not prevent; each is now an explicit checklist item at the relevant project stage.
The following table summarises the nineteen gated-shipped milestones that ran under the methodology's ratified discipline, plus one pre-doctrine reference row (Milestone C, SMTP STARTTLS) shown for chronological context but excluded from the 19. Each row's verdict is taken from the pack's results/<pack>/matrix.md; finding counts are taken from findings/INBOX.json. Implementation identities for unfiled findings remain anonymised pending coordinated disclosure; the three publicly-resolved cases (FINDING-031, FINDING-007, FINDING-010) are named explicitly. Three exploratory packs authored before the discipline gates were formalised (CBOR / X.509 / WebAuthn) are reported in Appendix B, not in this table and also excluded from the 19. Track legend: 1 = scanner-driven, 2 = audit-driven.
| Pack | RFC / Spec | Ambiguity class | Pool | Verdict | Findings | Track |
|---|---|---|---|---|---|---|
| ldap-bind-undefined | RFC 4513 §5.1.3 | server-behaviour-undefined | 3 LDAP servers | STOP-4 DIVERGENT | FINDING-007 (filed Red Hat 2026-05-12) | 1 |
| smtp-starttls-permissive | RFC 3207 §4.1 | permissive-continue | 3 mainstream MTAs | STOP-5 UNIFORM | none (documented-null; pre-doctrine) | 1 |
| tls-server-end-point-undefined | RFC 5929 §10.1 | hash-classification-undefined | 6 PostgreSQL client stacks | STOP-4 DIVERGENT | FINDING-010 ongres/scram (GHSA-p9jg-fcr6-3mhf, filed 2026-05-15, Triage) | 2 |
| mls-keypackage-validation-undefined | RFC 9420 §10 / §10.1 | MUST-malformed-handling | 4 MLS implementations | STOP-4 DIVERGENT (2/4 silently accept) | (pre-INBOX lineage) | 2 |
| dkim-sha1-rejection-undefined | RFC 8301 §3.1 | MUST-NOT-but-not-prescribed | 3 DKIM verifier libraries | STOP-4 DIVERGENT (2/3 accept SHA-1) | FINDING-011, FINDING-012 | 1 |
| oauth-jwt-at-key-separation-undefined | RFC 9068 §4 vs §5 | key-separation-MAY | 5 OAuth servers | STOP-4 DIVERGENT (3/5 accept kid-absent) | FINDING-013..016 | 1 |
| jose-kid-confusion-undefined | RFC 7515 §4.1.4 + RFC 8725 §3.10 | kid-is-a-hint | 8 JOSE libraries across 7 languages | STOP-4 DIVERGENT (1/8 silently accepts all 5 probes) | FINDING-017 | 2 |
| saml-assertion-validation-undefined | SAML 2.0 core + CVE-2017-11428 / CVE-2024-45409 lineage | signature-wrap class | 6 SAML libraries | STOP-4 DIVERGENT (1/6 accepts 5/5; 2nd accepts 2/5) | FINDING-018, FINDING-019 | 2 |
| dnssec-validation-undefined | RFC 4033/4034/4035 + RFC 8624 §3.1 | chain-walk + algorithm policy | 8 DNS implementations | STOP-4 DIVERGENT (19/40 cells) | FINDING-020-A/B/C | 2 |
| url-parsing-undefined | RFC 3986 vs WHATWG URL | parse-canonicalize-divergent | 11 URL parsers across 7 languages | STOP-4 DIVERGENT (55/55 cells) | FINDING-021, FINDING-022 | 2 |
| dns-over-tls-auth-undefined | RFC 7858 §3.2 + RFC 8310 §5.1.3 | SHOULD/MAY pin-enforcement | 5 DoT resolvers | STOP-4 DIVERGENT (4/5 silently downgrade) | FINDING-023..026 | 1 |
| openpgp-self-sig-optional-undefined | RFC 9580 §10.2 | MAY-choose-to-omit | 5 OpenPGP implementations | STOP-4 DIVERGENT (P1 3-way split: 2 reject / 1 accept+mark-invalid / 2 accept+default-flags) | FINDING-027..029 | 1 |
| sip-identity-credential-source-undefined | RFC 8224 §7.2 + RFC 8225 | credential-discovery-implementation-specific | 5 SIP daemons (3 functional + 2 negative controls) | STOP-4 DIVERGENT (1/2 functional pair accepts identical fixture) | FINDING-030 | 1 |
| jku-jwks-fetch (PyJWT) | RFC 7515 §4.1.2 (JKU URI scheme) | scheme-validation-undefined | 5 JWT libraries / 5 ecosystems | STOP-4 DIVERGENT (1/5 ACCEPT file://; 4/5 REJECT) | FINDING-031 — fixed in PyJWT 2.13.0; CVE-2026-48522; GHSA-993g-76c3-p5m4 | 2 |
| dane-tlsa-usage2-undefined | RFC 7671 §5.2.2 | reconstruction-path-undefined | 2 DANE-native (OpenSSL + GnuTLS) + 3 neg-controls | STOP-5 UNIFORM (honest null; see §5) | none (documented-null) | 1 |
| acme-chain-persist-undefined | RFC 8555 §9.1 | client-persist-byte-undefined | 5 ACME clients | STOP-4 DIVERGENT (1/5 persists leaf+intermediate; 4/5 full chain) | FINDING-032 (defect-vs-design pending) | 2 |
| tls12-trust-anchor-undefined | RFC 5246 §7.4.2 + RFC 5280 §6.1 | name-constraint-uri-undefined | 5 TLS impls / 3 languages | STOP-4 DIVERGENT (1/5 fails to enforce Name Constraints URI-form on P2.3) | FINDING-033 (GnuTLS, maintainer-contact pending) | 2 |
| tls13-trust-anchor-omit | RFC 8446 §4.4.2 + RFC 5280 §6.1.1(d) | trust-anchor-extension-out-of-scope | 5 TLS impls / 3 languages | STOP-5 UNIFORM (honest null; see §5.2; oracle-validity applied prospectively) | none (documented-null) | 2 |
| dpop-token-type-resolution | RFC 9449 §5 | proof-of-possession-bypass | 5 OAuth/OpenID impls (verified-independent) | STOP-5 UNIFORM (honest null; see §5.3; proof-of-possession uniformly correct) | none (documented-null) | 2 |
| structured-headers-sfv-type | RFC 9651 §3.3.2 | integer-decimal-type-distinction | 6 SFV parsers (richest pool in project history; zero prior SFV CVEs) | STOP-4 DIVERGENT (1/6 loses Integer-vs-Decimal type distinction; 5/6 preserve it; confirmed at parser level via 2.5-cross-check) | pending PM-ACK disposition | 1 |
Count reconciliation (stated once, used consistently throughout this paper). The table above has twenty rows. Nineteen of those are the gated-shipped milestones — the milestones that ran under the methodology's ratified post-doctrine discipline. The twentieth row is Milestone C (SMTP STARTTLS, RFC 3207), included in the table for chronological completeness as the pre-doctrine baseline result and explicitly excluded from the 19. The three exploratory pre-discipline packs reported in Appendix B (CBOR / X.509 / WebAuthn) are also excluded from the 19. Of the 19 gated milestones, sixteen produced STOP-4 DIVERGENT verdicts and three produced STOP-5 UNIFORM — Milestone Q (DANE/RFC 7671), Milestone T (TLS 1.3/RFC 8446), and Milestone U (DPoP/RFC 9449), all shipped as post-doctrine honest nulls (see §5). The audit-driven J–V sub-sequence (Milestones J through V, May 2026 — thirteen gated milestones) is the project's strongest empirical claim about target-selection methodology under the ratified doctrine: ten STOP-4 DIVERGENT verdicts plus three STOP-5 UNIFORM (Q, T, and U) as honest nulls. Forty-two total Phase-A probe packs were authored across the project's history; the 19 gated + the 1 pre-doctrine reference + the 3 exploratory packs are the ones that produced reportable matrix output.
Twenty-five finding drafts are staged in findings/INBOX.json at the time of writing — FINDING-011 through FINDING-033, with FINDING-020 split into A/B/C accounting for the apparent numerical gap. One has been externally validated and publicly resolved. FINDING-031 (PyJWT PyJWKClient JKU SSRF, §2 case study) was filed via GitHub's Private Vulnerability Report on 2026-05-06 as GHSA-993g-76c3-p5m4, accepted by the maintainer on 2026-05-21, fixed in PyJWT 2.13.0 the same day (bundled with four other security fixes in the release), and assigned CVE-2026-48522 by GitHub on 2026-05-22. Two earlier pre-INBOX-lineage findings — FINDING-007 (LDAP server resultCode divergence at RFC 4513 §5.1.3, filed Red Hat 2026-05-12) and FINDING-010 (PostgreSQL channel-binding stack against RFC 5929, filed 2026-05-15 as GHSA-p9jg-fcr6-3mhf, see §6.3) — were filed with relevant maintainers and CNAs and remain in triage. The remaining twenty-two INBOX drafts carry status=staged_in_inbox or verified_divergent with pm_disposition=pending; none has been externally filed during the period covered by this paper, and implementation identities for those drafts are anonymised throughout (see Appendix C for the disposition snapshot).
One observation that single-protocol depth would have missed but that cross-protocol breadth surfaced reliably is the silent-downgrade rebuttal pattern. Findings filed against silent-downgrade-by-default behavior reliably trigger a maintainer rebuttal of the shape "the default mode is best-effort by design; the client does not require the security property to function; this is a feature request rather than a vulnerability." The rebuttal is internally consistent within each protocol's design philosophy. It is not technically wrong from the maintainer's frame: protocols with opportunistic security profiles accept partial protection by definition.
The disclosure-value of each individual instance lives outside the maintainer-correct-within-frame defense. Three instances in the project's record exhibit this shape: a DNS-over-TLS opportunistic-pin-enforcement surface (FINDING-023..026), an SMTP STARTTLS opportunistic surface (Milestone C, the project's single STOP-5 UNIFORM result), and a PostgreSQL channel-binding PREFER-mode surface (FINDING-010 lineage). The common reduction is identical across the three: server advertises feature X; MITM strips X; client silently accepts a session without X; application has no signal that the downgrade occurred. A maintainer-correct-by-frame defense holds in each instance individually. The pattern's argumentative weight only materialises when the three are presented together: silent-downgrade-by-default is a recurring CWE-757 (Selection of Less-Secure Algorithm During Negotiation) class affecting at least three protocol families, and the cross-protocol structural pattern is what shifts the argument from "this driver has a bug" to "silent-downgrade-by-default is a recurring spec-design class needing IETF/CWE attention."
A separate, sharper class also surfaced during the work and is worth distinguishing from the silent-downgrade-in-default-mode class: library bug masquerading as feature; explicit strict-mode silently violated. The canonical instance in the project's record is a PostgreSQL SCRAM channel-binding library against RFC 5929 tls-server-end-point. A client requests channelBinding=REQUIRE (explicit strict mode, not default PREFER); the server presents an Ed25519 certificate; the library's OID parser fails to recognise the modern algorithm-identifier shape (the parser was written against legacy RSA-style OIDs); the library silently downgrades to SCRAM-SHA-256 (non-PLUS) and the application configured with REQUIRE has no signal that the strict-mode contract was violated. This is CWE-358 (improperly implemented security check), not CWE-757, and the disclosure argument lives in the explicit-strict-mode violation rather than in the default-mode design philosophy. Conflating the two classes in disclosure correspondence weakens the argument on both sides; distinguishing them sharpens it.
The methodological lesson: same-class wire-level breaks across libraries can have different design-correctness postures, and that posture is itself a publishable data point. A naive cross-library divergence framing would lump default-PREFER and explicit-REQUIRE silent downgrades into a single "both silently downgrade" bucket. The HUNTER project's per-finding triage distinguishes them — and the publication weight of the cross-protocol pattern lives in the recurrence of silent-downgrade-by-default specifically, not in the broader claim that "implementations silently downgrade."
The autonomous pipeline twice leaned toward the more dramatic interpretation; the human verdict gate caught both episodes (§6.1 R, §6.2 S). Two instances in two consecutive milestones is pattern, not noise. The project's response was the autonomous=false binding at the verdict-shipping step: the assistant is permitted to compute a draft verdict; the assistant is not permitted to ship it. The doctrine trio (§7.7) is the structural encoding of that lesson — substrate-adequacy, oracle-validity, and infra-testability each formalise a specific class of failure the prior gate stack did not prevent. The pairing of AI scale-work and human judgment is not an aesthetic choice; it is the only configuration the empirical record supports.
Per-protocol harness cost varies by approximately 5×. Across the three most recent milestones (DoT, OpenPGP, SIP), the methodology's per-finding quality was consistent but the wall-clock cost to first verdict varied substantially. The DNS-over-TLS pack (mock DoT server plus five resolver daemons, each with its own per-impl config directory) and the OpenPGP pack (CLI binaries plus offline TSK fixtures) both reached verdict in one Phase-A run. The SIP pack required three Phase-A runs to reach a credible verdict — the first produced PARTIAL, the second produced a false STOP-5 UNIFORM caught by the positive-control gate at the project's PM review, and the third required a harness repair before any cell drove the implementation into the code path under test. Multi-daemon network-service families (SIP, with five SIP daemons each needing config plus module load plus message-routing before any probe reaches the credential-validation layer) are structurally harder than library-API families. The implication is that future target selection should bias toward cheap-to-test families (parsers, crypto libraries, CLI tools) unless the specific divergence in an expensive family is high-value enough to justify the additional setup cost.
The work is single-tester. All milestones reported here were conducted by one author with assistance from automated coding agents. The methodology's discipline gates (matrix-lock PM-ACK, Wave-4-Gate-1 PM-ACK, positive-control gating, parser-differential target gating) are designed to absorb the failure modes of a single-tester pipeline, but the absence of external replication is a real limitation. The artifacts (Phase-A matrices, finding drafts, scanner database) are durable and would in principle support external replication; this has not yet occurred.
Most findings remain candidate, not externally validated. Twenty-two of the twenty-five staged drafts carry status=staged_in_inbox or verified_divergent with pm_disposition=pending at the time of writing. Implementation identities for those drafts are anonymised throughout this paper for exactly this reason: publishing "named implementation X accepts malicious input Y" before the maintainer has fixed it is effectively public 0-day disclosure. Three findings are publicly resolvable: FINDING-031 (CVE-2026-48522, PyJWT, fixed) and FINDING-007 / FINDING-010 (filed with maintainers and remaining in triage at the time of writing). The disclosure-cadence decoupling from the methodology-cadence is deliberate (machine-pace mechanical work runs in minutes; human-pace disclosure decisions run in hours-to-days) but it means this paper reports the empirical work and cannot yet report on vendor-response patterns at filing scale.
STOP-5 UNIFORM is a legitimate outcome but reduces the pool of presentable findings. The SMTP STARTTLS pack produced uniform permissive responses across three production MTAs at the RFC 3207 §4.1 surface. This is reported as a documented-null outcome rather than as a methodology failure — the spec ambiguity is real but the ecosystem has converged on opportunistic-TLS-permissive as the de-facto behavior — and the appropriate downstream artifact is a spec-revision recommendation rather than a vulnerability disclosure. A naive read of the methodology as "find bugs" would underweight this class of result; a more honest framing is "characterise what the ecosystem has done about each ambiguity site, producing either vendor-disclosure or standards-body-recommendation output depending on which side of the divergence axis the empirical result lands on."
Verdict credibility depends on a positive control that not every probe family naturally affords. The SIP Identity pack made the positive-control discipline explicit because the probe's entire premise (dereferencing an info URI) produces a measurable mock-authority side-effect. Probe families where the expected behavior is purely a yes/no decision with no side-effect (a JWT signature verify; a TSK import that produces an in-memory keyring) cannot use side-effect counting as a positive control. For those families, the positive-control discipline reduces to a baseline-sanity row — at least one cell that is expected to accept a well-formed input and observably does — and the discipline depends on the harness author including that row.
Single-tester selection bias risk on track-2 target selection. The CVE-lineage inventory that drives track-2 candidate selection is curated. The project's track-2 doctrine specifies the gates (≥2 prior CVEs; ≥1 normative anchor; ≥5 actively-maintained implementations; ≥4 probe families) but the choice of which CVE class to mine next is a judgement call. The track record of 10 STOP-4 DIVERGENT + 3 STOP-5 UNIFORM across the J–V audit-driven sub-sequence is consistent with the methodology working as intended; it is also consistent with selection bias toward surfaces likely to produce divergence. The TRACK-1 class survey on JKU/key-fetch-SSRF (§6.4) is one bound on this concern — it shows the J–V class space is mostly mined and PyJWT was the live nugget remaining, not the start of a fertile vein. A purely scanner-driven (track-1) baseline at composite_score ≥ 0.85 from the same period — with the parser-differential gate applied honestly — would still be needed to fully disentangle methodology-yield from selection-bias. That baseline experiment has not been run.
The parser-differential gate is itself a heuristic. The per-candidate analysis that returns parser_differential=TRUE or FALSE is a deeper LLM-driven read of the candidate section than the density scorer performs, but it is still a model output. False negatives — sites where the model concludes no observable divergence but where empirical probing would in fact have produced divergence — are unobservable by construction; the project never runs Phase-A against gate-rejected candidates and so cannot measure how often the gate is wrong in that direction. False positives, by contrast, are observable as STOP-5 UNIFORM outcomes when the gate said TRUE; the project's post-doctrine STOP-5 UNIFORM rate (Milestones Q, T, and U out of nineteen gated-shipped) is consistent with a low false-positive rate at the gate, but the sample is small.
HUNTER is a methodology, not a tool. The contribution is narrow and precise: RFC-ambiguity sites are an enumerable, queryable surface; cross-implementation behaviour on those sites is empirically measurable through a per-pack differential harness; the harness's verdicts distinguish divergence from uniformity from harness failure — provided two gates constrain target selection and verdict emission, and three additional doctrines (substrate-adequacy, oracle-validity, infra-testability) constrain verdict credibility. The target-selection gate disqualifies high-density candidates whose ambiguity will not produce observable divergence, before any milestone commits resources to them. The positive-control gate disqualifies any verdict whose probe family produced no evidence of reaching the code under test. The doctrine trio condition verdict credibility on whether the substrate can structurally surface the differential, whether the oracle is RFC-grounded against the actual normative text, and whether the candidate is exercisable on the project's own infrastructure at all. Each was added after an empirical failure the prior stack did not catch.
Across nineteen gated-shipped milestones the methodology produced sixteen STOP-4 DIVERGENT verdicts and three STOP-5 UNIFORM honest nulls (Milestone Q, DANE/RFC 7671; Milestone T, TLS 1.3/RFC 8446 — with the oracle-validity doctrine applied prospectively, see §5.2; and Milestone U, DPoP/RFC 9449 — proof-of-possession uniformly correct across five implementations, see §5.3). Twenty-five finding drafts are staged in findings/INBOX.json; one has been externally validated and publicly resolved — CVE-2026-48522 against PyJWT, fixed in PyJWT 2.13.0 with the reporter credited and GitHub-assigned 2026-05-22. The audit-driven J–V sub-sequence (thirteen gated milestones, 10 STOP-4 DIVERGENT + 3 STOP-5 UNIFORM) is the project's strongest empirical claim about target-selection methodology under the ratified doctrine. The cross-protocol observation that emerged during the work — silent-downgrade-by-default as a recurring CWE-757 class whose maintainers correctly defend each instance within their own protocol's framing — required cross-protocol breadth to surface and would have been invisible to a single-protocol study.
The work's limitations are real and reported honestly. The autonomous pipeline twice leaned dramatic and was caught by the human verdict gate (§6.1 R, §6.2 S); per-protocol harness cost varies by approximately 5×; the work is single-tester; most findings remain candidates pending coordinated disclosure; the parser-differential gate is itself a heuristic with unobservable false negatives; the track-2 selection record is bounded by the TRACK-1 class survey (§6.4) but a fully scanner-driven baseline has not been run. The methodology's value proposition is not "AI finds bugs" — it is "AI scales the corpus work that wasn't previously feasible; human discipline gates what counts as a real finding; the pairing produces results trustworthy enough to file, sometimes through a STOP-4 DIVERGENT outcome, sometimes as a STOP-5 UNIFORM spec-revision recommendation, always with the substrate × oracle × infra preconditions checked."
The artifacts (Phase-A smoke matrices, INBOX drafts, scanner database, per-milestone retrospectives) are durable. The disclosure pipeline is decoupled from the methodology pipeline and runs at human cadence. The paper is a snapshot of the empirical record as of May 2026; subsequent milestones, external filings, and vendor responses will appear in follow-on writing — at which point individual case-study rows in this paper can be un-masked one finding at a time as each clears coordinated disclosure.
HUNTER Research, May 2026.
Terms encountered in this paper, grouped by the role they play and roughly the order they first appear. One-line definitions; the methodology-specific terms are expanded in full in the body. Term names linked to authoritative external references (IETF datatracker, MITRE, W3C, Wikipedia) — these open in a new tab.
composite_scoreparser_differentialcomposite_score with parser_differential = FALSE is a near-guaranteed wasted milestone.parser_differential = TRUE before a candidate becomes a milestone. Density alone is insufficient — see §7.2.reject, accept, accept-or-reject-consistently, or observable-mismatch.Three probe packs were authored before the discipline gates (target-selection / positive-control / doctrine trio) were formalised in the project's methodology log. They produced data points that have stood up across subsequent re-reads but were not gated through the full machinery. They are reported here for completeness rather than as load-bearing milestones.
| Pack | Spec | Pool | Observation |
|---|---|---|---|
| cbor | RFC 8949 §3.2 / §5.1 | 2 CBOR libraries | Narrow divergence on trailing-bytes handling; no map-key class signal at the time of the run. |
| x509 | X.509 path validation | 6 X.509 stacks | Multi-row divergence (1/6 accepts where 5/6 reject); chain-validation edge cases known to recur across published prior art. |
| webauthn | WebAuthn L2 / L3 fragments | 5 WebAuthn libraries | Multi-row structural-validation divergence; pre-discipline pack, no finding drafts staged. |
The reason these are not in the §8 master table is structural, not dismissive: none of the three passed through the parser-differential target-selection gate (§7.2), the positive-control verdict gate (§7.6), or the substrate-adequacy / oracle-validity preconditions (§7.7) the project formalised after the fifth gated milestone. Their observations are recorded but not claimed as gated-shipped findings.
State of the project's findings/INBOX.json at the time of writing (May 2026). Externally validated entries are named; remaining drafts are listed by INBOX identifier with anonymised vendor descriptions.
PyJWKClient JKU SSRF. GHSA-993g-76c3-p5m4. Filed 2026-05-06; accepted 2026-05-21; fixed in PyJWT 2.13.0 (bundled with four other security fixes); CVE assigned by GitHub 2026-05-22. Externally validated; publicly fixed.tls-server-end-point silent downgrade in channelBinding=REQUIRE mode. Filed 2026-05-15. Maintainer (Jorge / OnGres) partial-acknowledged with PREFER-vs-REQUIRE scope clarification 2026-05-20 — see §6.3. In triage at time of writing.Twenty-two further drafts (FINDING-011 through FINDING-030, plus FINDING-032 and FINDING-033) are staged. None has been externally filed during the period covered by this paper. Vendor identities remain anonymised pending coordinated disclosure. Once each clears disclosure, the corresponding row in §8 can be un-masked one finding at a time.
kid-absent tokens).kid-is-a-hint silent acceptance (1 of 8 JOSE libraries across 7 languages).