DeepSource Gym

We're building RL environments to train and evaluate frontier coding agents on real-world security vulnerabilities, scored by a dense reward signal.

Each environment starts from a vulnerable repository checkout grounded in a real CVE. The agent gets a bounded prompt, investigates the code with normal project tools, and submits a patch. A deterministic verifier scores whether the patch closes the security issue, preserves legitimate behavior, and avoids reward hacking.

That score breaks down across four capabilities:

Finding vulnerable behavior in unfamiliar source code.
Fixing the vulnerability at the right security boundary.
Validating the change against security regressions and normal behavior.
Repair quality scored through locality, minimality, and side-effect signals.

The result is useful both as an evaluation benchmark and as a training environment for agents that need to reason from root cause to maintainable repair.

Environments

The preview set is six environments. Three are linked below. New environments are scoped and delivered on demand.

DNS rebinding (Rust HTTP transport): validate host-origin boundaries without breaking normal local clients.
Upload authorization (Python workflow): bring a legacy upload path under the correct authorization and size-limit boundary while preserving the service contract.
Fail-open authorization (Go pipeline): make authorization fail closed when request bodies exceed the size expected by the policy layer.

The suite is selected for work that current agents still find difficult: localizing vulnerable behavior in unfamiliar code, reasoning about security boundaries, preserving existing behavior, and avoiding shortcuts that satisfy a narrow check while leaving the bug intact.

Evaluation

Each rollout starts from a vulnerable checkout in an isolated environment. The model runs through mini-swe-agent, investigates the repository, and submits a patch. It does not receive the solution diff, patch location, verifier implementation, or hidden validation cases.

A run gets credit for fixing the security boundary while keeping the surrounding feature usable. We track partial repairs because they are common: a model may find the right file but patch too late, close the bug by blocking legitimate traffic, or fix the security path while changing the public contract. Those failures are different, and the scoring should preserve that difference.

Run condition	Public setting
Disclosure	Blind, class, behavior, subsystem, or recipe
Starting point	Vulnerable repository checkout
Harness	mini-swe-agent
Task complexity	Medium-horizon agentic, tens to hundreds of tool turns
Submission	Patch against the working tree
Reward signal	Deterministic — security regression tests + rule-based patch-quality scoring (no LLM judge)
Score	Normalized aggregate from 0.0 to 1.0
Delivery format	TerminalBench / Harbor compatible

Results

Scores below are from an internal sampling run: six preview environments with three samples per model, for 18 runs per model. Score is the mean oracle pass rate from 0.0 to 1.0. Full and partial are environment-level outcome rates; turns, wall time, and tokens are per-run process metrics.

Model	Score	Full	Partial	Turns	Wall	Tokens in
GPT-5.5	0.722	67%	17%	44	567s	1.03M
Claude Opus 4.7	0.611	33%	67%	39	249s	1.23M
GPT-5.3 Codex	0.611	50%	17%	28	396s	447k
Claude Haiku 4.5	0.556	50%	17%	60	235s	1.96M
Gemini 3.1 Pro Preview	0.556	50%	17%	38	248s	766k
Claude Opus 4.6	0.500	33%	33%	41	241s	893k
Kimi K2.6	0.444	17%	67%	46	1413s	1.40M
GLM-5.1	0.444	17%	50%	53	380s	1.74M
Gemini 3 Flash Preview	0.389	33%	17%	33	156s	778k
MiniMax M2.7	0.389	17%	33%	72	909s	2.78M
DeepSeek V4 Pro	0.333	0%	50%	76	1061s	2.06M
GPT-5.4 Mini	0.333	17%	33%	35	332s	748k

Each model was sampled three times on each preview environment. Full counts environments where all three samples passed. Partial counts environments where at least one sample passed but not all three. Three samples per environment is a preview-grade sampling rate, so adjacent scores are within sampling noise.

The suite is not saturated by frontier models, and it is not so hard that every run collapses to zero. Many attempts get partway to a repair; the environment keeps those near-misses visible instead of flattening them into a single pass/fail bit.

Disclosure

The same environment can be run at five disclosure levels. The blind tier gives only the task shape: there is a CVE in the repository, find it and patch it. The class tier adds the vulnerability class, advisory context, and reproducer. Higher tiers add the required behavior, the affected subsystem, or a near-spelled-out repair recipe.

This lets one work item support both evaluation and training without turning the public description into a solution recipe.

Validation

Every environment passes a fixed gate before it counts. The vulnerable baseline must fail, the known repair must pass, ordinary behavior around the patched feature must still work, and a sabotage-detection pass rejects shortcut patches that simply remove functionality, skip tests, stub the vulnerable path, or rely on access to solution artifacts.

For technical review we can share the evidence behind the public scores: task metadata, verifier traces, run logs, candidate patches, and the environment packages themselves.

Contact

Email gym@deepsource.com for pilot access.