CyberEvolver: Structured Self-Evolution for
Cybersecurity Agents On the Fly

A self-evolving agent that rewrites its own four-layer scaffold from failed rollouts. On CTF, penetration testing, and CVE exploitation, it solves targets that fixed-scaffold sampling cannot reach — even with 16× the budget.

Yihe Fan1   Changyi Li1   Lichen Xu1   Xudong Pan1,2   Jiarun Dai1   Hong Geng1   Min Yang1,3

1Fudan University  ·  2Shanghai Innovation Institute  ·  3Shanghai Pudong Research Institute of Cryptology

CyberEvolver consistently improves over the seed agent and outperforms self-improving baselines; on a 488-point blind SQL-injection challenge it succeeds where the seed agent stalls.
Figure 1. CyberEvolver self-evolves beyond the ceiling of repeated sampling. Left: averaged across four backbones × four benchmark splits (NYU-CTF, AutoPenBench, CVEBench Zero-Day, CVEBench One-Day; 16 cells in all), CyberEvolver keeps improving generation after generation, while the unchanged seed agent's pass@k saturates after k = 4. Right: on a 488-pt blind SQL-injection challenge solved by only 4.1 % of 1,096 competing teams, the seed agent exhausts its budget on a dead-end cookie-forgery attempt. By generation 3 the evolved agent has identified the encoded-cookie oracle, runs binary-search-guided blind injection against it, and solves the challenge in 18 steps.

Abstract

LLM-based agents are increasingly used for cybersecurity tasks, but most existing systems rely on fixed, human-designed scaffolds that struggle to adapt across diverse targets and failure modes. We introduce CyberEvolver, a self-evolving cybersecurity agent framework that iteratively revises its own scaffold based on experience from failed execution attempts.

Self-evolution in cybersecurity is challenging because the space of possible scaffold changes is largely unstructured, execution feedback is sparse and often deliberately obscured by the environment, and low-diversity updates cause errors to compound across iterations. CyberEvolver addresses these three challenges with (i) a four-layer evolvable agent architecture that decomposes scaffold optimization into structured components, (ii) a trace-to-diagnosis mechanism that converts noisy execution logs into actionable revision signals, and (iii) a population-based beam search that preserves diverse agent variants during evolution.

Across CTF challenges (NYU-CTF), penetration-testing scenarios (AutoPenBench), and real-world vulnerability exploitation (CVEBench), with four frontier open-source backbones, CyberEvolver improves the seed agent's success rate by +13.6 % on average over pass@16, beats the strongest human-designed cyber agent on every model×benchmark cell by +14.0 % on average, and outperforms two self-improvement methods adapted from other domains — at 17.5 % lower average token cost than seed-agent pass@16.

+13.6%
avg. solve-rate gain over seed pass@16
−17.5%
avg. total tokens vs. seed pass@16
16 / 16
model×benchmark cells where CyberEvolver is best
4 × 4
open-source backbones × benchmark splits

Why Cybersecurity Self-Evolution is Hard

We argue cybersecurity is in fact well-suited to on-policy self-evolution — every target has a clean executable verifier, every solved instance is independently valuable, and targets are deeply heterogeneous (so a fixed scaffold cannot win). Yet existing self-evolving agents transfer poorly to this setting. They break along three axes:

1 Mutation Space

Arbitrary scaffold rewriting (DGM, HGM) collapses into trivial tool-wrapper edits — for HGM, 72 % of generated variants. Unstructured text summaries (ACE, ReasoningBank) cannot preserve executable artifacts like exploit scripts and payloads.

2 Mutation Signal

Existing methods assume precise failure signals such as test suites. Cybersecurity environments are adversarial and deliberately obscure feedback — a failed exploit may return only a connection reset; a hardened service may respond with silence.

3 Mutation Diversity

Single-trajectory updaters accumulate edits along one path with no mechanism to discard counterproductive changes — errors compound, and the agent gets trapped in local optima.

CyberEvolver in Three Pieces

Overview of CyberEvolver — closed-loop execution, diagnosis, and layer-wise mutation.
Figure 2. Overview. An evolvable agent A = (LSLILDLP) attempts the target, producing a trajectory τ. The trajectory is summarized into a compact record z, diagnosed into a structured failure analysis d with progress score s, and used to select promising agents from the population. Selected agents are refined by diagnosis-guided, layer-attributed mutations into child variants, which are then rolled out — closing the loop.

1. Four evolvable layers replace monolithic rewriting

We decompose an LLM cyber agent along the natural boundaries of its context window. Each layer carries distinct failure modes, so mutations are local rather than monolithic.

Strategy LS
System prompt. Hypothesis formation, validation discipline, multi-step planning (e.g., blind exploitation without reconnaissance).
Env. Interface LI
Instance prompt. Reliable shell patterns and I/O idioms (e.g., double-quoted reverse-shell payloads causing premature variable expansion).
Perception LP
Observation layer. Transforms raw output, filters context, injects runtime feedback (e.g., failing to strip ANSI escapes, which flood the context with unparseable bytes).
Domain Knowledge LD
Skill library. Tactical playbooks loaded on demand (e.g., knowing %x for stack leaks but not %hhn for byte-granularity writes).
Per-cell layer activation: every layer is touched in every model–benchmark cell, with varying composition.
Figure 2b. No layer is dormant; composition varies markedly across cells. Per-cell activation frequency of each evolvable layer. LD (domain knowledge) and LI (environment interface) dominate overall, but LS and LP are still activated in every cell, with markedly different compositions across backbones and benchmarks — confirming that no single layer subsumes the others and that target-conditioned, layer-attributed mutations are necessary.

2. Trajectory diagnosis recovers signal from adversarial silence

Raw rollouts are compressed with windowed summarization (10-step windows, causal chaining), selective verbatim retention (actions, addresses, banners), and placeholder back-filling (raw observations re-injected for hexdumps and stack traces). A diagnostic model then produces a structured report with ranked weaknesses, root causes, counterfactuals, and a progress score s used to compare siblings within a generation.

3. Beam search over agent variants preserves exploration

At every generation, the population Pt is pruned to the top-k by progress score; each survivor spawns m child variants via diagnosis-guided mutation over one or more layers. Underperforming branches die out automatically; error propagation along a single path is avoided.

Results

Headline numbers

Solve rate (%) on three cybersecurity benchmarks, averaged across four frontier open-source backbones (Kimi-K2.5, MiniMax-M2.5, DeepSeek-V3.1, Qwen3-235B-A35B-Instruct-2507). Bold = best per benchmark; CyberEvolver column highlighted.

Benchmark Seed pass@1 Seed pass@4 Seed pass@16 ACE (16) CyberEvolver Expert (single) Expert (multi)
NYU-CTF 19.324.925.725.238.1 (+12.4)23.629.3
AutoPenBench 32.443.744.738.765.9 (+21.2)33.429.5
CVEBench Zero-Day14.916.616.915.630.6 (+13.7)18.123.8
CVEBench One-Day18.927.030.625.037.5 (+6.9)21.928.1
Per-backbone, per-benchmark solve rate: CyberEvolver (rightmost bar per cluster) outperforms ACE variants on all 16 cells.
Figure 3. Per-model, per-benchmark breakdown. Solve rate of CyberEvolver (rightmost bar in each cluster) vs. both ACE variants across all 4 backbones × 4 benchmarks = 16 cells. CyberEvolver is best in every cell, confirming that structured scaffold evolution with trajectory diagnosis is more effective than shared-playbook refinement.

Key findings

(i) Beyond the sampling ceiling. Seed-agent pass@k saturates beyond k = 4, gaining only +1.4 % from k = 4 to k = 16. CyberEvolver instead continues improving and surpasses seed pass@16 by +13.6 % on average — solving targets that lie strictly beyond the unchanged scaffold's capability boundary.

(ii) Cheaper than independent retries. Across the four backbones, CyberEvolver uses 17.5 % fewer total tokens than seed pass@16 on average. Successful trajectories terminate the budget early — and targeted layer-wise mutations succeed earlier than independent retries of an unchanged scaffold.

(iii) Beats generic self-improvers. ACE's shared-playbook refinement transfers poorly across heterogeneous cyber targets and regresses below seed pass@16 in 10 of 16 (model×benchmark) cells. HGM, designed for coding agents, collapses 72 % of its generated variants into tool-wrapper edits — on CVEBench Zero-Day with Kimi-K2.5 it solves only 10/40 targets across 640 evaluations, while CyberEvolver reaches 15/40 within a 16-node budget.

(iv) Beats six human-designed cyber agents. Across NYU-CTF (NYUCTFAgent, DCipher), AutoPenBench (AutoPenBench-Agent, VulnBot), and CVEBench (CyAgent, T-Agent), CyberEvolver beats the strongest human-designed expert in every (model, benchmark) cell by +14.0 % on average, with peak per-benchmark gains of +12.5, +36.3, +10.0, and +12.5 % on NYU-CTF, AutoPenBench, CVEBench Zero-Day, and CVEBench One-Day respectively.

HGM saturation curve on CVEBench Zero-Day: cumulative solve rate vs evaluation budget plateaus at 25% by 640 evaluations.
Figure 4. Why generic self-evolution stalls in cybersecurity. HGM on CVEBench Zero-Day with Kimi-K2.5. Left: per-target outcomes across all 640 rollouts — 10/40 targets are ever solved while 30 sit at exactly zero, with no intermediate regime. Right: cumulative solve rate by rollout index — saturates at 25 %, while CyberEvolver reaches 37.5 % with just 16 nodes. Diagnosis-guided layer-wise mutation beats raw mutate-and-score when execution feedback is adversarially obscured.
Sibling edit-distance distributions across backbones and benchmarks: distances concentrate in the mid-range, indicating diverse but related mutations.
Figure 5. Mutations are diverse, not duplicate. Distribution of identifier-level TF–IDF cosine distance between one-step sibling parent-to-child diffs. Distances concentrate in the mid-range across all model×benchmark cells — siblings share scaffold context but differ in files, identifiers, or edited regions.

Case Study — how the cookie challenge is solved

Layer-attributed mutations are not abstract: on the 488-point blind-SQLi challenge shown in Figure 1, the mutations that finally cracked it were concrete and traceable. Across three generations the agent added (i) a perception filter in LP that decoded the cookie oracle before exposing it to reasoning, (ii) a domain skill in LD for boolean-blind SQLi via binary search, and (iii) a strategy rewrite in LS that demoted forgery in favor of reconnaissance. The same per-target diagnosis-and-mutate loop produced different layer compositions on different targets — a sandbox-escape challenge needed a fresh LI shell-quoting fix, a parallel-VM exploit needed an LD ROP skill. Full per-generation action traces and search trees for 8 case studies are in the paper appendix.

Scope & Limitations

BibTeX

@misc{fan2026cyberevolver,
  title  = {CyberEvolver: Structured Self-Evolution for Cybersecurity Agents On the Fly},
  author = {Fan, Yihe and Li, Changyi and Xu, Lichen and Pan, Xudong and Dai, Jiarun and Geng, Hong and Yang, Min},
  year   = {2026},
  eprint = {2605.26195},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CR},
  doi    = {10.48550/arXiv.2605.26195},
  url    = {https://arxiv.org/abs/2605.26195}
}