Implement monitor rubrics #653

mikasenghaas · 2025-12-21T19:15:22Z

Description

This PR implements the idea of MonitorRubric, a simple extensions of Rubric which only allows to register metrics, ie. reward functions with weight 0.0, with simple helpers to build up metric logging across environment hierarchies. The idea is that monitor rubrics are typically tied to specific environments (e.g. ToolEnv has ToolMonitorRubric or SandboxEnv has SandboxMonitorRubric), and such monitor rubric are added in an additive fashion, such that metrics across the whole inheritance tree are respected and logged. For example, in the math_python environment, we have the custom MathRubric registered as the base rubric to determine the main reawrd. However, we may want to emit metrics for observability. In the math-python example, the inheritance tree is PythonEnv(SandboxEnv(StatefulToolEnv(ToolEnv(MultiTurnEnv(...))))). At each level of this hierarchy, different metrics should be emitted, e.g.:

MultiTurnEnv: num_turns
ToolEnv: total_tool_calls, {tool_name}_calls (for each tool)
SandboxEnv: sandbox_ready_wait_time, sandbox_command_execution_time
PythonEnv: python_ready_wait_time

This is implemented with a monitor metric for each env. This means that any future env inherting from envs with a monitor rubric will get the logging of the parent envs for free, making it a very general pattern to help observability during env execution.

Example

uv run vf-eval math-python -n1 -r1 -v

We see the following registered monitors from the respective environments:

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Testing

All existing tests pass when running uv run pytest locally.
New tests have been added to cover the changes

Checklist

My code follows the style guidelines of this project as outlined in AGENTS.md
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Additional Notes

Note

Adds lightweight, additive monitoring metrics across the environment hierarchy and unifies rubric composition.

Introduces Environment.add_rubric to compose rubrics (wraps into RubricGroup when needed)
Adds monitor rubrics:
- MultiTurnMonitorRubric → num_turns
- ToolMonitorRubric (in ToolEnv) → total_tool_calls, {tool}_calls
- SandboxMonitorRubric → sandbox_ready_wait_time, sandbox_command_execution_time (avg), with state tracking of ready_wait_time and per-command timings
- PythonMonitorRubric → python_ready_wait_time, with recorded worker ready_wait_time
Updates SandboxEnv/PythonEnv to record and log readiness/command timings; propagates state fields accordingly
Simplifies math_python.load_environment to return PythonEnv directly (no explicit tool rubric wiring)
Removes deprecated ToolRubric and its exports; updates tests to include num_turns in EnvGroupRubric metrics

^{Written by Cursor Bugbot for commit 573d68f. This will update automatically on new commits. Configure here.}

willccbb · 2025-12-21T21:45:22Z

verifiers/envs/multiturn_env.py

 logger = logging.getLogger(__name__)


+class MultiTurnMonitorRubric(MonitorRubric):


small nit -- what's the need for a distinct MultiTurnMonitorRubric class? All env classes currently extend MultiTurnEnv already, and build up the trajectory (even if single-turn)

Risks other extensions (e.g. SandboxMonitorRubric) not being multi-turn compatible by default, which seems undesirable.

hm, im not sure i understand fully, but the idea is that each env may create an associated monitor rubric, e.g.:

MultiTurnEnv has MultiTurnMonitorRubric

ToolEnv has ToolMonitorRubric

SandboxEnv has SandboxMonitorRubric

etc.
in this sense, MultiTurnMonitorRubric is just the associated monitor rubric class to the base multi turn env and will get registered by default with it. since all envs extend multi turn env, all envs will have this monitor metric registered. i agree that for for single turn envs, it might make sense to remove the turn count as it's a bit silly here

willccbb · 2025-12-21T21:47:48Z

verifiers/envs/sandbox_env.py

+    command_execution_times: list[float]
+
+
+class SandboxMonitorRubric(vf.MonitorRubric):


Is this serving a different role than MultiTurnMonitorRubric? is the idea to have many monitor rubrics which track new things introduced at different hierarchies?

yea, sandbox monitor rubric would only implement monitor metrics specified to the sandbox env-- i.e. currently sandbox_ready_wait_time and sandbox_command_execution_time. if every env in the env hierarchy, e.g. for math-python we have PythonEnv(SandboxEnv(StatefulToolEnv(ToolEnv(MultiTurnEnv(...))))), register env-specific monitor metrics any env that inherits from any of these gets all the logging for free

willccbb · 2025-12-21T21:49:31Z

Semi-related -- #645 adds an option to bypass scoring, which currently would skip all Rubric classes. Do we want to treat MonitorRubric instances as fundamentally separate in the case of a RubricGroup?

verifiers/envs/environment.py

verifiers/envs/multiturn_env.py

mikasenghaas · 2026-01-05T15:16:10Z

Semi-related -- #645 adds an option to bypass scoring, which currently would skip all Rubric classes. Do we want to treat MonitorRubric instances as fundamentally separate in the case of a RubricGroup?

ah yea, we probably always wanna run the monitor rubrics, even if scoring is disabled. any ideas on what would be a good way of distinguishing those?

i feel like this could be a separate pr tho, bc wouuld likely have to distinguish subrewards/metrics at the state and rubric level, so would be a bigger change than what this pr proposes.

…ric to new pattern

verifiers/envs/python_env.py

verifiers/envs/sandbox_env.py

cursor · 2026-01-05T17:26:19Z

verifiers/envs/tool_env.py

        }
        super().__init__(oai_tools=self.oai_tools, max_turns=max_turns, **kwargs)

+        self.add_rubric(ToolMonitorRubric(tools=self.tools))


Per-tool metrics not tracked for dynamically added tools

ToolMonitorRubric computes tool_names at initialization time from the tools list passed to it. However, ToolEnv.__init__ creates the rubric before subclasses like SandboxEnv and PythonEnv call add_tool() to register their tools (bash, python). Since tool_names is computed once at init as an empty list, per-tool metrics like python_calls and bash_calls are never registered, defeating the purpose of the {tool_name}_calls feature described in the PR.

Additional Locations (1)

verifiers/envs/tool_env.py#L16-L22

ah, yea this is fair but i think this wasnt supported previously with the ToolRubric? adding would be easy to add, removing would be a bit harder bc we would have to remove a reward func from a rubric which we currently don't have support for

ah we should probably fix this (can be separate PR), shouldn't be too hard

previously we weren't using tool_rubric that heavily, and it already forced you to pass the list of tools, so it wasn't too big of a problem to just have the added tool in that list as well

mikasenghaas requested a review from willccbb December 21, 2025 19:20

willccbb reviewed Dec 21, 2025

View reviewed changes

willccbb reviewed Dec 22, 2025

View reviewed changes

verifiers/envs/environment.py Outdated Show resolved Hide resolved

willccbb reviewed Dec 22, 2025

View reviewed changes

verifiers/envs/multiturn_env.py Outdated Show resolved Hide resolved

mikasenghaas added 8 commits January 5, 2026 15:17

implement monitor rubric

58c0e48

add to __init__

84f0bbd

allow arbitrary transforms

34324dc

allow adding rubric

dbf7a23

make monitor rubric with 3 diff args

1fd31f4

add monitor metrics in common classes

8510a1a

integrate with math python

21f75a0

fix tests

4d26051

mikasenghaas force-pushed the monitor-rubric branch from 1d34438 to 4d26051 Compare January 5, 2026 15:19

mikasenghaas added 7 commits January 5, 2026 15:20

append rubric if rubric is alr a rubric group

edddce4

refactor monitor rubric to support arbitrary metrics + adapt tool rub…

099e025

…ric to new pattern

move rubric

14b566f

fix tests

f85bdaa

remove simplistic docstrings

e07e029

remove explicit monitor metric

4be2f68

also remove from __init__

6a54dfe

mikasenghaas marked this pull request as ready for review January 5, 2026 17:14

mikasenghaas requested a review from willccbb January 5, 2026 17:14

cursor bot reviewed Jan 5, 2026

View reviewed changes

fix state init

573d68f

willccbb merged commit 6f74cb0 into main Jan 5, 2026
6 checks passed

This was referenced Jan 6, 2026

Fix tool test env #692

Merged

Do not explicitly set tool rubric PrimeIntellect-ai/research-environments#82

Merged

		logger = logging.getLogger(__name__)


		class MultiTurnMonitorRubric(MonitorRubric):

		command_execution_times: list[float]


		class SandboxMonitorRubric(vf.MonitorRubric):

Implement monitor rubrics #653

Implement monitor rubrics #653

Uh oh!

Conversation

mikasenghaas commented Dec 21, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Example

Type of Change

Testing

Checklist

Additional Notes

Uh oh!

willccbb Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

willccbb Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

mikasenghaas Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

willccbb Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

mikasenghaas Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

willccbb commented Dec 21, 2025

Uh oh!

Uh oh!

Uh oh!

mikasenghaas commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot Jan 5, 2026

Choose a reason for hiding this comment

Per-tool metrics not tracked for dynamically added tools

Uh oh!

mikasenghaas Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

willccbb Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mikasenghaas commented Dec 21, 2025 •

edited by cursor bot

Loading

mikasenghaas commented Jan 5, 2026 •

edited

Loading