Skip to content

Conversation

@mikasenghaas
Copy link
Member

@mikasenghaas mikasenghaas commented Dec 21, 2025

Description

This PR implements the idea of MonitorRubric, a simple extensions of Rubric which only allows to register metrics, ie. reward functions with weight 0.0, with simple helpers to build up metric logging across environment hierarchies. The idea is that monitor rubrics are typically tied to specific environments (e.g. ToolEnv has ToolMonitorRubric or SandboxEnv has SandboxMonitorRubric), and such monitor rubric are added in an additive fashion, such that metrics across the whole inheritance tree are respected and logged. For example, in the math_python environment, we have the custom MathRubric registered as the base rubric to determine the main reawrd. However, we may want to emit metrics for observability. In the math-python example, the inheritance tree is PythonEnv(SandboxEnv(StatefulToolEnv(ToolEnv(MultiTurnEnv(...))))). At each level of this hierarchy, different metrics should be emitted, e.g.:

  • MultiTurnEnv: num_turns
  • ToolEnv: total_tool_calls, {tool_name}_calls (for each tool)
  • SandboxEnv: sandbox_ready_wait_time, sandbox_command_execution_time
  • PythonEnv: python_ready_wait_time

This is implemented with a monitor metric for each env. This means that any future env inherting from envs with a monitor rubric will get the logging of the parent envs for free, making it a very general pattern to help observability during env execution.

Example

uv run vf-eval math-python -n1 -r1 -v

We see the following registered monitors from the respective environments:

Screenshot 2026-01-05 at 4 47 42 PM

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes


Note

Adds lightweight, additive monitoring metrics across the environment hierarchy and unifies rubric composition.

  • Introduces Environment.add_rubric to compose rubrics (wraps into RubricGroup when needed)
  • Adds monitor rubrics:
    • MultiTurnMonitorRubricnum_turns
    • ToolMonitorRubric (in ToolEnv) → total_tool_calls, {tool}_calls
    • SandboxMonitorRubricsandbox_ready_wait_time, sandbox_command_execution_time (avg), with state tracking of ready_wait_time and per-command timings
    • PythonMonitorRubricpython_ready_wait_time, with recorded worker ready_wait_time
  • Updates SandboxEnv/PythonEnv to record and log readiness/command timings; propagates state fields accordingly
  • Simplifies math_python.load_environment to return PythonEnv directly (no explicit tool rubric wiring)
  • Removes deprecated ToolRubric and its exports; updates tests to include num_turns in EnvGroupRubric metrics

Written by Cursor Bugbot for commit 573d68f. This will update automatically on new commits. Configure here.

logger = logging.getLogger(__name__)


class MultiTurnMonitorRubric(MonitorRubric):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit -- what's the need for a distinct MultiTurnMonitorRubric class? All env classes currently extend MultiTurnEnv already, and build up the trajectory (even if single-turn)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Risks other extensions (e.g. SandboxMonitorRubric) not being multi-turn compatible by default, which seems undesirable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, im not sure i understand fully, but the idea is that each env may create an associated monitor rubric, e.g.:

  • MultiTurnEnv has MultiTurnMonitorRubric
  • ToolEnv has ToolMonitorRubric
  • SandboxEnv has SandboxMonitorRubric
  • etc.
    in this sense, MultiTurnMonitorRubric is just the associated monitor rubric class to the base multi turn env and will get registered by default with it. since all envs extend multi turn env, all envs will have this monitor metric registered. i agree that for for single turn envs, it might make sense to remove the turn count as it's a bit silly here

command_execution_times: list[float]


class SandboxMonitorRubric(vf.MonitorRubric):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this serving a different role than MultiTurnMonitorRubric? is the idea to have many monitor rubrics which track new things introduced at different hierarchies?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, sandbox monitor rubric would only implement monitor metrics specified to the sandbox env-- i.e. currently sandbox_ready_wait_time and sandbox_command_execution_time. if every env in the env hierarchy, e.g. for math-python we have PythonEnv(SandboxEnv(StatefulToolEnv(ToolEnv(MultiTurnEnv(...))))), register env-specific monitor metrics any env that inherits from any of these gets all the logging for free

@willccbb
Copy link
Member

Semi-related -- #645 adds an option to bypass scoring, which currently would skip all Rubric classes. Do we want to treat MonitorRubric instances as fundamentally separate in the case of a RubricGroup?

@mikasenghaas
Copy link
Member Author

mikasenghaas commented Jan 5, 2026

Semi-related -- #645 adds an option to bypass scoring, which currently would skip all Rubric classes. Do we want to treat MonitorRubric instances as fundamentally separate in the case of a RubricGroup?

ah yea, we probably always wanna run the monitor rubrics, even if scoring is disabled. any ideas on what would be a good way of distinguishing those?

i feel like this could be a separate pr tho, bc wouuld likely have to distinguish subrewards/metrics at the state and rubric level, so would be a bigger change than what this pr proposes.

@mikasenghaas mikasenghaas marked this pull request as ready for review January 5, 2026 17:14
@mikasenghaas mikasenghaas requested a review from willccbb January 5, 2026 17:14
}
super().__init__(oai_tools=self.oai_tools, max_turns=max_turns, **kwargs)

self.add_rubric(ToolMonitorRubric(tools=self.tools))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per-tool metrics not tracked for dynamically added tools

ToolMonitorRubric computes tool_names at initialization time from the tools list passed to it. However, ToolEnv.__init__ creates the rubric before subclasses like SandboxEnv and PythonEnv call add_tool() to register their tools (bash, python). Since tool_names is computed once at init as an empty list, per-tool metrics like python_calls and bash_calls are never registered, defeating the purpose of the {tool_name}_calls feature described in the PR.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, yea this is fair but i think this wasnt supported previously with the ToolRubric? adding would be easy to add, removing would be a bit harder bc we would have to remove a reward func from a rubric which we currently don't have support for

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah we should probably fix this (can be separate PR), shouldn't be too hard

previously we weren't using tool_rubric that heavily, and it already forced you to pass the list of tools, so it wasn't too big of a problem to just have the added tool in that list as well

@willccbb willccbb merged commit 6f74cb0 into main Jan 5, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants