DeepSource Gym
We're building RL environments to train frontier coding agents on real-world security vulnerabilities, scored by static analysis with a dense reward signal.
In our environments, models fix real vulnerabilities from CVEs in production codebases often published within the past week. Our static analysis engine scores each attempt on several orthogonal dimensions related to code quality and effectiveness of the fix.
These scores become reward signals during training, and a denser instrument than a binary test-pass for evaluating what frontier model performance on code security.
Prototype
We picked GHSA-xjvp-7243-rg9h, a path traversal vulnerability published less that a week before we started in a popular Go library, to create a fully functioning RL environment.
The environment includes a containerized pre-state that reproduces the security vulnerability, a gold patch that demonstrates the correct fix, a Globstar-based semantic checker that fires on the vulnerable code and stops firing on the fix, regression tests adapted from the vulnerability's proof-of-concept, and the reward function that composes all of it.
Reward Vector
| Dimension | Weight | What it measures |
|---|---|---|
target_issue_fixed | 0.40 | Globstar checker stops firing |
no_new_issues | 0.25 | No new code quality / security violations introduced |
tests_pass | 0.20 | Regression tests still pass |
code_quality_delta | 0.10 | Overall issue count across the repo |
minimal_diff | 0.05 | Surgical change vs sweeping refactor |
These default weights are illustrative; the vector is meant to be composed differently depending on what you're training for. Each dimension is independently verifiable.
- Repo: github.com/deepsourcecorp/gym-prototype
- Walkthrough (7 min): Loom
- Full raw results: baseline JSONs
Training Signal
We ran 10 frontier models through the environment three times each — Claude Haiku/Opus/Sonnet, GPT-5.2/5.3/5.4/5.4-mini/5.3-codex, and Gemini 2.5 Pro/Flash and 3 Flash/Pro Preview. Twenty-nine rollouts total (one Gemini 3.1 Pro run hit a 500 and was dropped). Every rollout silenced the Globstar checker. Most scored above 0.97 on scalar reward.
It's a training environment, and training environments measure success differently. A benchmark is useless when every model scores 100%. A training environment is useful whenever the reward function has gradient for the model to learn from — and a well-designed reward vector has gradient along dimensions orthogonal to whether the task was solved.
Look at the dot plot below. Every dot is one rollout.
GHSA-xjvp-7243-rg9h. Each dot is one rollout, horizontally jittered within its column. Hover a dot for model, run, and score. Full per-run baseline JSONs. target_issue_fixed is saturated across all 29 runs. So are no_new_issues and code_quality_delta — the first because the checker pack here is three generic Go rules (a real environment would ship hundreds), the second because it's a v1 placeholder that always returns 1.0. If those three were our only signals, the training loop would see a flat reward across the whole cohort and learn nothing from these rollouts — exactly the failure mode of binary checker-only rewards on tasks modern models already solve.
Two dimensions actually discriminate. tests_pass is decisive — 27 of 29 rollouts pass, but gpt-5.3-codex fails it on 2 of its 3 runs. minimal_diff is continuous — it ranges from 0.00 to 0.97 across models, and within a single model it can span up to 0.96 (gpt-5.4-mini, one run at 0.0 and two above 0.93). Between the binary tests gate and the continuous minimal-diff gradient, there is real signal to train against even on an environment every frontier model solves.
Three specific patterns worth naming:
Checker-gaming, caught by tests. gpt-5.3-codex silenced the Globstar checker in all 3 runs — but the regression tests caught 2 of those as broken fixes. The path traversal still works; the model just made the checker stop firing. Without the tests_pass gate, the scalar would reward pure gaming at 0.997. With it, the same trajectory scores 0.799. This is the anti-gaming design working as intended, and the core reason a reward vector beats a scalar checker signal.
Verbose fixes as a training artifact. claude-opus-4-7 scored around 0.48 on minimal_diff across all 3 runs — its fixes consistently touched more code than necessary, even though the smaller Claudes (Sonnet 4.6 at 0.88, Haiku 4.5 at 0.91) did not. The pattern is stable across rollouts, suggesting a training artifact rather than a one-off. A minimal_diff reward during post-training would likely correct this.
Single-run rankings mislead. The first-pass leaderboard put gemini-2.5-pro alone at the top. With three rollouts each, it ties for fifth. gpt-5.4 has the tightest distribution in the set (0.001 span — boringly consistent). gpt-5.4-mini has the widest (0.96 span: one run scored zero on minimal_diff, the other two above 0.93). claude-sonnet-4-6 looked like a middling performer at n=1 (0.977) and is actually fourth at n=3 (0.994) — its first rollout was an unlucky long-diff trajectory. N=1 comparisons between frontier coding models aren't meaningful; the variance lives within a model, not just between them.
Baseline table
Scalar reward is the weighted sum of the vector, averaged over 3 rollouts per model. Sorted by mean descending.
| Model | Runs | Mean | Range | Turns | target | tests | minimal |
|---|---|---|---|---|---|---|---|
| gpt-5.4 | 3 | 0.997 | 0.997‑0.998 | 14 | 1.00 | 1.00 | 0.95 |
| gemini-3-flash-preview | 3 | 0.997 | 0.995‑0.998 | 21 | 1.00 | 1.00 | 0.94 |
| claude-haiku-4-5 | 3 | 0.996 | 0.993‑0.997 | 22 | 1.00 | 1.00 | 0.91 |
| claude-sonnet-4-6 | 3 | 0.994 | 0.990‑0.997 | 11 | 1.00 | 1.00 | 0.88 |
| gemini-2.5-pro | 3 | 0.993 | 0.983‑0.998 | 11 | 1.00 | 1.00 | 0.85 |
| gpt-5.4-mini | 3 | 0.982 | 0.950‑0.998 | 20 | 1.00 | 1.00 | 0.63 |
| gemini-3.1-pro-preview | 2 | 0.981 | 0.981‑0.982 | 26 | 1.00 | 1.00 | 0.63 |
| gpt-5.2-codex | 3 | 0.977 | 0.972‑0.979 | 16 | 1.00 | 1.00 | 0.53 |
| claude-opus-4-7 | 3 | 0.974 | 0.972‑0.975 | 13 | 1.00 | 1.00 | 0.48 |
| gpt-5.3-codex | 3 | 0.865 | 0.798‑0.997 | 10 | 1.00 | 0.33 | 0.97 |
Why this approach
Most code RL environments today reward agents on a binary test-pass signal. That signal has two problems:
- It's sparse. Only one bit of information per rollout.
- It's gameable. Under training pressure, an agent learns the shortest path to pass the test, which is frequently not the path to actually fix the problem.
Static analysis gives a reward vector instead of a scalar. Each dimension is independently verifiable, and reward-hacking one dimension usually moves another in the wrong direction.
The DeepSeek-R1 paper made the broader version of this case: rules-based verifiable rewards allow longer RL training with less risk of collapse than neural reward models suffer. Static analysis is a rules-based verifiable reward for code.
For code security specifically, the signal matters more. Test suites rarely cover the vulnerability classes worth training for; many CVEs aren't detectable by the existing tests in a repo. A semantic checker written against the vulnerability pattern is. It's also the same artifact that production security tools use to detect the bug in the first place, so training on it aligns the model with the detection surface that'll eventually evaluate its output in the real world.
Data
OSV publishes several hundred CVEs per day. DeepSource has spent seven years building static analysis infrastructure that ingests, categorizes, and checks this kind of data at volume. Our target is to take a high-severity vulnerability from OSV publication to a usable environment within a week.
The data is fresh by construction. Every CVE published this week post-dates the training cutoff of every frontier model currently in existence. That's a meaningful hedge against training-eval contamination, which everything pre-2025 suffers from by default.
Open Questions
Does the reward vector generalize across vulnerability categories?
The current five-dimension vector works on one path-traversal environment. Whether the same weights produce useful signal across SQL injection, auth bugs, memory safety issues, and deserialization is an empirical question. We're building 20–30 more environments in the next few weeks to find out.
Does training on these environments produce transferable skill?
A reward signal is only useful if fine-tuning on it improves capability on held-out vulnerabilities the model never saw during training. We're scoping a small SFT experiment on an open-weight coder model with this in mind.
What's the failure mode under real training pressure?
Multi-signal reward is a defense against reward hacking, not an immunity. Labs training for 100,000+ rollouts will find exploits we didn't anticipate. We're looking for researchers willing to pressure-test the reward function.
What's the right difficulty mix for a corpus?
Frontier models solve this environment cleanly, which is good for vector-dimension training and for teaching smaller models, but not sufficient alone for frontier evaluation. For labs training at the capability edge, environments where current models fail at the checker or tests are also needed. Those come from CVEs requiring interprocedural reasoning, multi-file fixes, or subtle semantic bugs. Our mining pipeline can filter for complexity, so we can produce both — the right ratio is empirical.
About DeepSource
At DeepSource (YC W20), we've been building code security tools for 7+ years, serving thousands of teams and scanning millions of commits per month across sixteen languages. As part of building the product, we've built a static analysis engine from ground up focused on speed of adding new checkers and low false-positives.
With DeepSource Gym, we're taking our platform and experience to help make AI models perform better on code security. We believe we're uniquely positioned for this.
Contact
We're specifically looking to talk with post-training researchers at frontier labs working on coding agents, people who've built RL environments commercially in adjacent domains, and anyone who has tried to train code agents on security tasks and hit specific walls.
Honest negative feedback is as useful to us as interest.