-
Notifications
You must be signed in to change notification settings - Fork 465
Implement monitor rubrics #653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
verifiers/envs/multiturn_env.py
Outdated
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| class MultiTurnMonitorRubric(MonitorRubric): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small nit -- what's the need for a distinct MultiTurnMonitorRubric class? All env classes currently extend MultiTurnEnv already, and build up the trajectory (even if single-turn)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Risks other extensions (e.g. SandboxMonitorRubric) not being multi-turn compatible by default, which seems undesirable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, im not sure i understand fully, but the idea is that each env may create an associated monitor rubric, e.g.:
MultiTurnEnvhasMultiTurnMonitorRubricToolEnvhasToolMonitorRubricSandboxEnvhasSandboxMonitorRubric- etc.
in this sense,MultiTurnMonitorRubricis just the associated monitor rubric class to the base multi turn env and will get registered by default with it. since all envs extend multi turn env, all envs will have this monitor metric registered. i agree that for for single turn envs, it might make sense to remove the turn count as it's a bit silly here
verifiers/envs/sandbox_env.py
Outdated
| command_execution_times: list[float] | ||
|
|
||
|
|
||
| class SandboxMonitorRubric(vf.MonitorRubric): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this serving a different role than MultiTurnMonitorRubric? is the idea to have many monitor rubrics which track new things introduced at different hierarchies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, sandbox monitor rubric would only implement monitor metrics specified to the sandbox env-- i.e. currently sandbox_ready_wait_time and sandbox_command_execution_time. if every env in the env hierarchy, e.g. for math-python we have PythonEnv(SandboxEnv(StatefulToolEnv(ToolEnv(MultiTurnEnv(...))))), register env-specific monitor metrics any env that inherits from any of these gets all the logging for free
|
Semi-related -- #645 adds an option to bypass scoring, which currently would skip all Rubric classes. Do we want to treat MonitorRubric instances as fundamentally separate in the case of a RubricGroup? |
ah yea, we probably always wanna run the monitor rubrics, even if scoring is disabled. any ideas on what would be a good way of distinguishing those? i feel like this could be a separate pr tho, bc wouuld likely have to distinguish subrewards/metrics at the state and rubric level, so would be a bigger change than what this pr proposes. |
1d34438 to
4d26051
Compare
…ric to new pattern
| } | ||
| super().__init__(oai_tools=self.oai_tools, max_turns=max_turns, **kwargs) | ||
|
|
||
| self.add_rubric(ToolMonitorRubric(tools=self.tools)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per-tool metrics not tracked for dynamically added tools
ToolMonitorRubric computes tool_names at initialization time from the tools list passed to it. However, ToolEnv.__init__ creates the rubric before subclasses like SandboxEnv and PythonEnv call add_tool() to register their tools (bash, python). Since tool_names is computed once at init as an empty list, per-tool metrics like python_calls and bash_calls are never registered, defeating the purpose of the {tool_name}_calls feature described in the PR.
Additional Locations (1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, yea this is fair but i think this wasnt supported previously with the ToolRubric? adding would be easy to add, removing would be a bit harder bc we would have to remove a reward func from a rubric which we currently don't have support for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah we should probably fix this (can be separate PR), shouldn't be too hard
previously we weren't using tool_rubric that heavily, and it already forced you to pass the list of tools, so it wasn't too big of a problem to just have the added tool in that list as well
Description
This PR implements the idea of
MonitorRubric, a simple extensions ofRubricwhich only allows to register metrics, ie. reward functions with weight0.0, with simple helpers to build up metric logging across environment hierarchies. The idea is that monitor rubrics are typically tied to specific environments (e.g.ToolEnvhasToolMonitorRubricorSandboxEnvhasSandboxMonitorRubric), and such monitor rubric are added in an additive fashion, such that metrics across the whole inheritance tree are respected and logged. For example, in themath_pythonenvironment, we have the customMathRubricregistered as the base rubric to determine the main reawrd. However, we may want to emit metrics for observability. In themath-pythonexample, the inheritance tree isPythonEnv(SandboxEnv(StatefulToolEnv(ToolEnv(MultiTurnEnv(...))))). At each level of this hierarchy, different metrics should be emitted, e.g.:MultiTurnEnv:num_turnsToolEnv:total_tool_calls,{tool_name}_calls(for each tool)SandboxEnv:sandbox_ready_wait_time,sandbox_command_execution_timePythonEnv:python_ready_wait_timeThis is implemented with a monitor metric for each env. This means that any future env inherting from envs with a monitor rubric will get the logging of the parent envs for free, making it a very general pattern to help observability during env execution.
Example
We see the following registered monitors from the respective environments:
Type of Change
Testing
uv run pytestlocally.Checklist
Additional Notes
Note
Adds lightweight, additive monitoring metrics across the environment hierarchy and unifies rubric composition.
Environment.add_rubricto compose rubrics (wraps intoRubricGroupwhen needed)MultiTurnMonitorRubric→num_turnsToolMonitorRubric(inToolEnv) →total_tool_calls,{tool}_callsSandboxMonitorRubric→sandbox_ready_wait_time,sandbox_command_execution_time(avg), with state tracking ofready_wait_timeand per-command timingsPythonMonitorRubric→python_ready_wait_time, with recorded workerready_wait_timeSandboxEnv/PythonEnvto record and log readiness/command timings; propagates state fields accordinglymath_python.load_environmentto returnPythonEnvdirectly (no explicit tool rubric wiring)ToolRubricand its exports; updates tests to includenum_turnsinEnvGroupRubricmetricsWritten by Cursor Bugbot for commit 573d68f. This will update automatically on new commits. Configure here.