DeepSource Gym

We're building RL environments to train and evaluate frontier coding agents on real-world security vulnerabilities, scored by a dense reward signal.

Each environment starts from a vulnerable repository checkout grounded in a real CVE. The agent gets a bounded prompt, investigates the code with normal project tools, and submits a patch. A deterministic verifier scores whether the patch closes the security issue, preserves legitimate behavior, and avoids reward hacking.

That score breaks down across four capabilities:

  • Finding vulnerable behavior in unfamiliar source code.
  • Fixing the vulnerability at the right security boundary.
  • Validating the change against security regressions and normal behavior.
  • Repair quality scored through locality, minimality, and side-effect signals.

The result is useful both as an evaluation benchmark and as a training environment for agents that need to reason from root cause to maintainable repair.

Environments

The preview set is six environments. Three are linked below. New environments are scoped and delivered on demand.

The suite is selected for work that current agents still find difficult: localizing vulnerable behavior in unfamiliar code, reasoning about security boundaries, preserving existing behavior, and avoiding shortcuts that satisfy a narrow check while leaving the bug intact.

Evaluation

Each rollout starts from a vulnerable checkout in an isolated environment. The model runs through mini-swe-agent, investigates the repository, and submits a patch. It does not receive the solution diff, patch location, verifier implementation, or hidden validation cases.

A run gets credit for fixing the security boundary while keeping the surrounding feature usable. We track partial repairs because they are common: a model may find the right file but patch too late, close the bug by blocking legitimate traffic, or fix the security path while changing the public contract. Those failures are different, and the scoring should preserve that difference.

Run conditionPublic setting
DisclosureBlind, class, behavior, subsystem, or recipe
Starting pointVulnerable repository checkout
Harnessmini-swe-agent
Task complexityMedium-horizon agentic, tens to hundreds of tool turns
SubmissionPatch against the working tree
Reward signalDeterministic — security regression tests + rule-based patch-quality scoring (no LLM judge)
ScoreNormalized aggregate from 0.0 to 1.0
Delivery formatTerminalBench / Harbor compatible

Results

Scores below are from an internal sampling run: six preview environments with three samples per model, for 18 runs per model. Score is the mean oracle pass rate from 0.0 to 1.0. Full and partial are environment-level outcome rates; turns, wall time, and tokens are per-run process metrics.

ModelScoreFullPartialTurnsWallTokens in
GPT-5.50.72267%17%44567s1.03M
Claude Opus 4.70.61133%67%39249s1.23M
GPT-5.3 Codex0.61150%17%28396s447k
Claude Haiku 4.50.55650%17%60235s1.96M
Gemini 3.1 Pro Preview0.55650%17%38248s766k
Claude Opus 4.60.50033%33%41241s893k
Kimi K2.60.44417%67%461413s1.40M
GLM-5.10.44417%50%53380s1.74M
Gemini 3 Flash Preview0.38933%17%33156s778k
MiniMax M2.70.38917%33%72909s2.78M
DeepSeek V4 Pro0.3330%50%761061s2.06M
GPT-5.4 Mini0.33317%33%35332s748k

Each model was sampled three times on each preview environment. Full counts environments where all three samples passed. Partial counts environments where at least one sample passed but not all three. Three samples per environment is a preview-grade sampling rate, so adjacent scores are within sampling noise.

The suite is not saturated by frontier models, and it is not so hard that every run collapses to zero. Many attempts get partway to a repair; the environment keeps those near-misses visible instead of flattening them into a single pass/fail bit.

Disclosure

The same environment can be run at five disclosure levels. The blind tier gives only the task shape: there is a CVE in the repository, find it and patch it. The class tier adds the vulnerability class, advisory context, and reproducer. Higher tiers add the required behavior, the affected subsystem, or a near-spelled-out repair recipe.

This lets one work item support both evaluation and training without turning the public description into a solution recipe.

Validation

Every environment passes a fixed gate before it counts. The vulnerable baseline must fail, the known repair must pass, ordinary behavior around the patched feature must still work, and a sabotage-detection pass rejects shortcut patches that simply remove functionality, skip tests, stub the vulnerable path, or rely on access to solution artifacts.

For technical review we can share the evidence behind the public scores: task metadata, verifier traces, run logs, candidate patches, and the environment packages themselves.

Contact

Email gym@deepsource.com for pilot access.