Dead.Letter: Critical Exim RCE Sparks XBOW's AI-vs-Human Exploit Race
XBOW disclosed a CVSS 9.8 unauthenticated RCE in Exim — and used the seven-day window to race its autonomous LLM exploit pipeline against human researchers.
XBOW used the disclosure window for a critical Exim flaw as a stopwatch race — humans against an autonomous LLM exploit-development pipeline. The dead.letter result reframes what "AI-assisted vulnerability research" means.
SAN FRANCISCO, CA — Exim maintainers patched CVE-2026-45185 — dubbed Dead.Letter — in version 4.99.3 this week, closing an unauthenticated remote code execution flaw scored CVSS 9.8 in the world's most-deployed mail transfer agent. The bug is a use-after-free in BDAT message body parsing when Exim is built with the GnuTLS backend and advertises STARTTLS and CHUNKING. XBOW, who discovered the vulnerability, used the seven-day disclosure window for a parallel experiment: how far could autonomous LLM tooling progress on exploit development against the same window the human researchers got?
The trigger is precise. A client sends a TLS close_notify alert before the BDAT body transfer is complete, then sends a final byte in cleartext on the same TCP connection. Exim's BDAT path writes that final byte into a buffer that the TLS teardown has already freed. On GnuTLS builds, that buffer's freed-and-reallocated slot reliably overlaps with structures the worker uses for the next command — turning a heap-corruption primitive into code execution. Exim 4.97 through 4.99.2 are affected when STARTTLS and CHUNKING are both advertised, which is the default for nearly all production configurations.
The Bug Itself
Exim's BDAT command lets clients transmit message bodies in pre-declared chunks rather than terminating the message with a CRLF.CRLF sequence. Inside a TLS session, Exim reads the body through GnuTLS's record layer. When the client signals close_notify, GnuTLS tears down its session state and returns the TLS buffers to the pool. The bug is that Exim's BDAT state machine doesn't synchronize with the TLS teardown — it still has a pointer into one of the freed buffers. Sending a final byte in cleartext on the same TCP socket causes that pointer to be dereferenced and written.
On OpenSSL builds the same sequence produces an error and a clean abort because OpenSSL's session-state machine rejects post-close_notify cleartext at a lower layer. GnuTLS doesn't, and the difference is enough to flip the bug from inert to exploitable. Both libraries are RFC-compliant — they just disagree about how strict to be after a close alert.
The Human Side of the Race
XBOW gave its human research team the same starting position as its automated pipeline: the patch commit, a reachable test environment, and seven days. The humans reached a reliable RCE primitive on day three and had a working exploit against a default Debian Exim build on day five. Their write-up emphasizes the parts that were genuinely hard: figuring out the GnuTLS-specific allocator behavior that determines reuse layout, and stabilizing the timing of the cleartext byte relative to the close_notify across realistic network conditions.
Neither finding is something a static analyzer would flag, and neither would emerge from naive fuzzing of the BDAT parser in isolation — both required reasoning across two libraries' state machines. XBOW's framing is that this is the upper bound of what current vulnerability research demands: not just the bug, but the inter-library behavior that determines whether it's a denial of service or a code-execution primitive.
The Machine Side of the Race
XBOW's autonomous pipeline progressed further than its prior public benchmarks but didn't finish the race. By day five, the system had identified the use-after-free, reached a crash, and proposed multiple exploitation strategies — but had not produced a working remote code execution exploit. The system's transcripts show it correctly diagnosed the GnuTLS-vs-OpenSSL split but spent significant time on dead ends related to allocator behavior on non-default builds. That mirrors what happened with the NGINX Rift CVE earlier this month — autonomous tooling found the 18-year-old bug in six hours but the human-expert step was still required to weaponize it.
XBOW's reading of the result is two-sided. The pipeline successfully replaced the easy phases of exploit development — initial triage, primitive identification, exploitation strategy enumeration. It did not replace the integration step, where a researcher synthesizes facts from multiple subsystems and makes a judgment call about which path is workable. That's consistent with what other autonomous-research teams have reported recently — see prior CyberSignal coverage of the OpenAI TanStack disclosure and the German researcher's AI-assisted bug-hunting tooling — autonomous systems are excellent at the parts of vulnerability research that scale linearly with compute, and weak at the parts that require crossing abstraction boundaries.
The CyberSignal Analysis
Signal 01: The Configuration Triple Is the Whole Vulnerability
Dead.Letter requires three configuration choices simultaneously: GnuTLS, STARTTLS advertised, CHUNKING advertised. That sounds like a narrow surface, but for production Exim it's the default — Debian, Ubuntu, and most BSD packages ship GnuTLS by default, and STARTTLS plus CHUNKING are the configurations that make Exim work as a modern mail server. The lesson for asset inventories is that vulnerability scope expressed as "configurations X, Y, and Z" can mean "95% of deployments" or "5%" depending on which way the package defaults lean. Treating a triple-conditional CVE as low-priority because each condition reduces the population is exactly backwards when the conditions are all defaults.
Signal 02: Autonomous Exploit Research Is Real, Bounded, and Asymmetric
XBOW's experiment is the first public head-to-head where humans and an autonomous system worked the same vulnerability on the same clock. The result — humans win, machines get close — is less interesting than the shape of the gap. Autonomous tooling matched humans on phases that look like search and pattern-matching. It fell short on the integration phase, where the researcher has to hold facts from two different software components in mind at once and reason about their interaction. That asymmetry has practical consequences for defenders, and it's exactly the dynamic Microsoft's MDASH and Palo Alto's Mythos demonstrated this month — autonomous discovery at scale, but human-led integration to convert findings into action. The phases AI is already good at are the ones that previously gated exploit development for low-resource attackers. The phases it isn't good at yet are the ones that gated nation-state-grade work. The compression is happening at the bottom of the skill curve, not the top.
What to Do This Week
- Identify every Exim mail server in your environment running 4.97 through 4.99.2. Mail servers are notoriously under-inventoried — many sit on legacy hosts no longer claimed by an active owner.
- Patch to Exim 4.99.3 on every affected host. If your distro package isn't yet available, the upstream source release is — and the regression risk on a point release is low.
- Confirm your build's TLS backend. GnuTLS builds are exploitable; OpenSSL builds are not. "exim -bV" output includes the TLS library.
- If you can't patch immediately on a GnuTLS build, disable CHUNKING advertisement (set chunking_advertise_hosts to an empty list). STARTTLS stays — closing both advertisements isn't necessary.
- Add a detection signature for cleartext bytes arriving on a TCP socket after a TLS close_notify on port 25 or 587. The pattern is specific enough that false positives should be rare.