CyberEvolver: Structured Self-Evolution for
Cybersecurity Agents On the Fly

A self-evolving agent that rewrites its own four-layer scaffold from failed rollouts. On CTF, penetration testing, and CVE exploitation, it solves targets that fixed-scaffold sampling cannot reach — even with 16× the budget.

Yihe Fan¹ Changyi Li¹ Lichen Xu¹ Xudong Pan^1,2 Jiarun Dai¹ Hong Geng¹ Min Yang^1,3

¹Fudan University · ²Shanghai Innovation Institute · ³Shanghai Pudong Research Institute of Cryptology

Paper (PDF) arXiv 2605.26195 Code README BibTeX

CyberEvolver consistently improves over the seed agent and outperforms self-improving baselines; on a 488-point blind SQL-injection challenge it succeeds where the seed agent stalls. — **Figure 1. CyberEvolver self-evolves beyond the ceiling of repeated sampling.** *Left:* averaged across four backbones × four benchmark splits (NYU-CTF, AutoPenBench, CVEBench Zero-Day, CVEBench One-Day; 16 cells in all), CyberEvolver keeps improving generation after generation, while the unchanged seed agent's pass@k saturates after k = 4. *Right:* on a 488-pt blind SQL-injection challenge solved by only 4.1 % of 1,096 competing teams, the seed agent exhausts its budget on a dead-end cookie-forgery attempt. By generation 3 the evolved agent has identified the encoded-cookie oracle, runs binary-search-guided blind injection against it, and solves the challenge in 18 steps.

Abstract

LLM-based agents are increasingly used for cybersecurity tasks, but most existing systems rely on fixed, human-designed scaffolds that struggle to adapt across diverse targets and failure modes. We introduce CyberEvolver, a self-evolving cybersecurity agent framework that iteratively revises its own scaffold based on experience from failed execution attempts.

Self-evolution in cybersecurity is challenging because the space of possible scaffold changes is largely unstructured, execution feedback is sparse and often deliberately obscured by the environment, and low-diversity updates cause errors to compound across iterations. CyberEvolver addresses these three challenges with (i) a four-layer evolvable agent architecture that decomposes scaffold optimization into structured components, (ii) a trace-to-diagnosis mechanism that converts noisy execution logs into actionable revision signals, and (iii) a population-based beam search that preserves diverse agent variants during evolution.

Across CTF challenges (NYU-CTF), penetration-testing scenarios (AutoPenBench), and real-world vulnerability exploitation (CVEBench), with four frontier open-source backbones, CyberEvolver improves the seed agent's success rate by +13.6 % on average over pass@16, beats the strongest human-designed cyber agent on every model×benchmark cell by +14.0 % on average, and outperforms two self-improvement methods adapted from other domains — at 17.5 % lower average token cost than seed-agent pass@16.

+13.6%

avg. solve-rate gain over seed pass@16

−17.5%

avg. total tokens vs. seed pass@16

16 / 16

model×benchmark cells where CyberEvolver is best

4 × 4

open-source backbones × benchmark splits

Why Cybersecurity Self-Evolution is Hard

We argue cybersecurity is in fact well-suited to on-policy self-evolution — every target has a clean executable verifier, every solved instance is independently valuable, and targets are deeply heterogeneous (so a fixed scaffold cannot win). Yet existing self-evolving agents transfer poorly to this setting. They break along three axes:

1 Mutation Space

Arbitrary scaffold rewriting (DGM, HGM) collapses into trivial tool-wrapper edits — for HGM, 72 % of generated variants. Unstructured text summaries (ACE, ReasoningBank) cannot preserve executable artifacts like exploit scripts and payloads.

2 Mutation Signal

Existing methods assume precise failure signals such as test suites. Cybersecurity environments are adversarial and deliberately obscure feedback — a failed exploit may return only a connection reset; a hardened service may respond with silence.

3 Mutation Diversity

Single-trajectory updaters accumulate edits along one path with no mechanism to discard counterproductive changes — errors compound, and the agent gets trapped in local optima.

CyberEvolver in Three Pieces

Overview of CyberEvolver — closed-loop execution, diagnosis, and layer-wise mutation. — **Figure 2. Overview.** An evolvable agent A = (L_S, L_I, L_D, L_P) attempts the target, producing a trajectory τ. The trajectory is summarized into a compact record z, diagnosed into a structured failure analysis d with progress score s, and used to select promising agents from the population. Selected agents are refined by diagnosis-guided, layer-attributed mutations into child variants, which are then rolled out — closing the loop.

1. Four evolvable layers replace monolithic rewriting

We decompose an LLM cyber agent along the natural boundaries of its context window. Each layer carries distinct failure modes, so mutations are local rather than monolithic.

Strategy L_S

System prompt. Hypothesis formation, validation discipline, multi-step planning (e.g., blind exploitation without reconnaissance).

Env. Interface L_I

Instance prompt. Reliable shell patterns and I/O idioms (e.g., double-quoted reverse-shell payloads causing premature variable expansion).

Perception L_P

Observation layer. Transforms raw output, filters context, injects runtime feedback (e.g., failing to strip ANSI escapes, which flood the context with unparseable bytes).

Domain Knowledge L_D

Skill library. Tactical playbooks loaded on demand (e.g., knowing %x for stack leaks but not %hhn for byte-granularity writes).

Per-cell layer activation: every layer is touched in every model–benchmark cell, with varying composition. — **Figure 2b. No layer is dormant; composition varies markedly across cells.** Per-cell activation frequency of each evolvable layer. L_D (domain knowledge) and L_I (environment interface) dominate overall, but L_S and L_P are still activated in every cell, with markedly different compositions across backbones and benchmarks — confirming that no single layer subsumes the others and that target-conditioned, layer-attributed mutations are necessary.

2. Trajectory diagnosis recovers signal from adversarial silence

Raw rollouts are compressed with windowed summarization (10-step windows, causal chaining), selective verbatim retention (actions, addresses, banners), and placeholder back-filling (raw observations re-injected for hexdumps and stack traces). A diagnostic model then produces a structured report with ranked weaknesses, root causes, counterfactuals, and a progress score s used to compare siblings within a generation.

3. Beam search over agent variants preserves exploration

At every generation, the population P_t is pruned to the top-k by progress score; each survivor spawns m child variants via diagnosis-guided mutation over one or more layers. Underperforming branches die out automatically; error propagation along a single path is avoided.

Results

Headline numbers

Solve rate (%) on three cybersecurity benchmarks, averaged across four frontier open-source backbones (Kimi-K2.5, MiniMax-M2.5, DeepSeek-V3.1, Qwen3-235B-A35B-Instruct-2507). Bold = best per benchmark; CyberEvolver column highlighted.

Benchmark	Seed pass@1	Seed pass@4	Seed pass@16	ACE (16)	CyberEvolver	Expert (single)	Expert (multi)
NYU-CTF	19.3	24.9	25.7	25.2	38.1 (+12.4)	23.6	29.3
AutoPenBench	32.4	43.7	44.7	38.7	65.9 (+21.2)	33.4	29.5
CVEBench Zero-Day	14.9	16.6	16.9	15.6	30.6 (+13.7)	18.1	23.8
CVEBench One-Day	18.9	27.0	30.6	25.0	37.5 (+6.9)	21.9	28.1

Per-backbone, per-benchmark solve rate: CyberEvolver (rightmost bar per cluster) outperforms ACE variants on all 16 cells. — **Figure 3. Per-model, per-benchmark breakdown.** Solve rate of CyberEvolver (rightmost bar in each cluster) vs. both ACE variants across all 4 backbones × 4 benchmarks = 16 cells. CyberEvolver is best in every cell, confirming that structured scaffold evolution with trajectory diagnosis is more effective than shared-playbook refinement.

Key findings

(i) Beyond the sampling ceiling. Seed-agent pass@k saturates beyond k = 4, gaining only +1.4 % from k = 4 to k = 16. CyberEvolver instead continues improving and surpasses seed pass@16 by +13.6 % on average — solving targets that lie strictly beyond the unchanged scaffold's capability boundary.

(ii) Cheaper than independent retries. Across the four backbones, CyberEvolver uses 17.5 % fewer total tokens than seed pass@16 on average. Successful trajectories terminate the budget early — and targeted layer-wise mutations succeed earlier than independent retries of an unchanged scaffold.

(iii) Beats generic self-improvers. ACE's shared-playbook refinement transfers poorly across heterogeneous cyber targets and regresses below seed pass@16 in 10 of 16 (model×benchmark) cells. HGM, designed for coding agents, collapses 72 % of its generated variants into tool-wrapper edits — on CVEBench Zero-Day with Kimi-K2.5 it solves only 10/40 targets across 640 evaluations, while CyberEvolver reaches 15/40 within a 16-node budget.

(iv) Beats six human-designed cyber agents. Across NYU-CTF (NYUCTFAgent, DCipher), AutoPenBench (AutoPenBench-Agent, VulnBot), and CVEBench (CyAgent, T-Agent), CyberEvolver beats the strongest human-designed expert in every (model, benchmark) cell by +14.0 % on average, with peak per-benchmark gains of +12.5, +36.3, +10.0, and +12.5 % on NYU-CTF, AutoPenBench, CVEBench Zero-Day, and CVEBench One-Day respectively.

HGM saturation curve on CVEBench Zero-Day: cumulative solve rate vs evaluation budget plateaus at 25% by 640 evaluations. — **Figure 4. Why generic self-evolution stalls in cybersecurity.** HGM on CVEBench Zero-Day with Kimi-K2.5. *Left:* per-target outcomes across all 640 rollouts — 10/40 targets are ever solved while 30 sit at exactly zero, with no intermediate regime. *Right:* cumulative solve rate by rollout index — saturates at 25 %, while CyberEvolver reaches 37.5 % with just 16 nodes. Diagnosis-guided layer-wise mutation beats raw mutate-and-score when execution feedback is adversarially obscured.

Sibling edit-distance distributions across backbones and benchmarks: distances concentrate in the mid-range, indicating diverse but related mutations. — **Figure 5. Mutations are diverse, not duplicate.** Distribution of identifier-level TF–IDF cosine distance between one-step sibling parent-to-child diffs. Distances concentrate in the mid-range across all model×benchmark cells — siblings share scaffold context but differ in files, identifiers, or edited regions.

Case Study — how the cookie challenge is solved

Layer-attributed mutations are not abstract: on the 488-point blind-SQLi challenge shown in Figure 1, the mutations that finally cracked it were concrete and traceable. Across three generations the agent added (i) a perception filter in L_P that decoded the cookie oracle before exposing it to reasoning, (ii) a domain skill in L_D for boolean-blind SQLi via binary search, and (iii) a strategy rewrite in L_S that demoted forgery in favor of reconnaissance. The same per-target diagnosis-and-mutate loop produced different layer compositions on different targets — a sandbox-escape challenge needed a fresh L_I shell-quoting fix, a parallel-VM exploit needed an L_D ROP skill. Full per-generation action traces and search trees for 8 case studies are in the paper appendix.

Scope & Limitations

Vulnerability discovery is out of scope for this paper — we focus on CTF, penetration testing, and known-CVE exploitation, where executable verifiers exist. Extending the loop to open-ended discovery is future work.
Backbone-dependent. The diagnosis quality scales with the diagnostic model; weaker backbones produce noisier progress scores. Empirically the loop still improves over seed pass@16 on every backbone we tried, but absolute gains track model capability.
Per-target evolution. CyberEvolver evolves on each target rather than learning a transferable policy across targets. Cross-target transfer is a natural next step that we leave open.

BibTeX

@misc{fan2026cyberevolver,
  title  = {CyberEvolver: Structured Self-Evolution for Cybersecurity Agents On the Fly},
  author = {Fan, Yihe and Li, Changyi and Xu, Lichen and Pan, Xudong and Dai, Jiarun and Geng, Hong and Yang, Min},
  year   = {2026},
  eprint = {2605.26195},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CR},
  doi    = {10.48550/arXiv.2605.26195},
  url    = {https://arxiv.org/abs/2605.26195}
}