DeepSource Gym
We're building RL environments to train and evaluate frontier coding agents on real-world security vulnerabilities, scored by a dense reward signal.
Each environment starts from a vulnerable repository checkout grounded in a real CVE. The agent gets a bounded prompt, investigates the code with normal project tools, and submits a patch. A deterministic verifier scores whether the patch closes the security issue, preserves legitimate behavior, and avoids reward hacking.
That score breaks down across four capabilities:
- Finding vulnerable behavior in unfamiliar source code.
- Fixing the vulnerability at the right security boundary.
- Validating the change against security regressions and normal behavior.
- Repair quality scored through locality, minimality, and side-effect signals.
The result is useful both as an evaluation benchmark and as a training environment for agents that need to reason from root cause to maintainable repair.
Environments
The preview set is six environments. Three are linked below. New environments are scoped and delivered on demand.
- DNS rebinding (Rust HTTP transport): validate host-origin boundaries without breaking normal local clients.
- Upload authorization (Python workflow): bring a legacy upload path under the correct authorization and size-limit boundary while preserving the service contract.
- Fail-open authorization (Go pipeline): make authorization fail closed when request bodies exceed the size expected by the policy layer.
The suite is selected for work that current agents still find difficult: localizing vulnerable behavior in unfamiliar code, reasoning about security boundaries, preserving existing behavior, and avoiding shortcuts that satisfy a narrow check while leaving the bug intact.
Evaluation
Each rollout starts from a vulnerable checkout in an isolated environment. The model runs through mini-swe-agent, investigates the repository, and submits a patch. It does not receive the solution diff, patch location, verifier implementation, or hidden validation cases.
A run gets credit for fixing the security boundary while keeping the surrounding feature usable. We track partial repairs because they are common: a model may find the right file but patch too late, close the bug by blocking legitimate traffic, or fix the security path while changing the public contract. Those failures are different, and the scoring should preserve that difference.
| Run condition | Public setting |
|---|---|
| Disclosure | Blind, class, behavior, subsystem, or recipe |
| Starting point | Vulnerable repository checkout |
| Harness | mini-swe-agent |
| Task complexity | Medium-horizon agentic, tens to hundreds of tool turns |
| Submission | Patch against the working tree |
| Reward signal | Deterministic — security regression tests + rule-based patch-quality scoring (no LLM judge) |
| Score | Normalized aggregate from 0.0 to 1.0 |
| Delivery format | TerminalBench / Harbor compatible |
Results
Scores below are from an internal sampling run: six preview environments with three samples per model, for 18 runs per model. Score is the mean oracle pass rate from 0.0 to 1.0. Full and partial are environment-level outcome rates; turns, wall time, and tokens are per-run process metrics.
| Model | Score | Full | Partial | Turns | Wall | Tokens in |
|---|---|---|---|---|---|---|
| GPT-5.5 | 0.722 | 67% | 17% | 44 | 567s | 1.03M |
| Claude Opus 4.7 | 0.611 | 33% | 67% | 39 | 249s | 1.23M |
| GPT-5.3 Codex | 0.611 | 50% | 17% | 28 | 396s | 447k |
| Claude Haiku 4.5 | 0.556 | 50% | 17% | 60 | 235s | 1.96M |
| Gemini 3.1 Pro Preview | 0.556 | 50% | 17% | 38 | 248s | 766k |
| Claude Opus 4.6 | 0.500 | 33% | 33% | 41 | 241s | 893k |
| Kimi K2.6 | 0.444 | 17% | 67% | 46 | 1413s | 1.40M |
| GLM-5.1 | 0.444 | 17% | 50% | 53 | 380s | 1.74M |
| Gemini 3 Flash Preview | 0.389 | 33% | 17% | 33 | 156s | 778k |
| MiniMax M2.7 | 0.389 | 17% | 33% | 72 | 909s | 2.78M |
| DeepSeek V4 Pro | 0.333 | 0% | 50% | 76 | 1061s | 2.06M |
| GPT-5.4 Mini | 0.333 | 17% | 33% | 35 | 332s | 748k |
Each model was sampled three times on each preview environment. Full counts environments where all three samples passed. Partial counts environments where at least one sample passed but not all three. Three samples per environment is a preview-grade sampling rate, so adjacent scores are within sampling noise.
The suite is not saturated by frontier models, and it is not so hard that every run collapses to zero. Many attempts get partway to a repair; the environment keeps those near-misses visible instead of flattening them into a single pass/fail bit.
Disclosure
The same environment can be run at five disclosure levels. The blind tier gives only the task shape: there is a CVE in the repository, find it and patch it. The class tier adds the vulnerability class, advisory context, and reproducer. Higher tiers add the required behavior, the affected subsystem, or a near-spelled-out repair recipe.
This lets one work item support both evaluation and training without turning the public description into a solution recipe.
Validation
Every environment passes a fixed gate before it counts. The vulnerable baseline must fail, the known repair must pass, ordinary behavior around the patched feature must still work, and a sabotage-detection pass rejects shortcut patches that simply remove functionality, skip tests, stub the vulnerable path, or rely on access to solution artifacts.
For technical review we can share the evidence behind the public scores: task metadata, verifier traces, run logs, candidate patches, and the environment packages themselves.
Contact
Email gym@deepsource.com for pilot access.