[OPIK-3211] [BE] [FE] Add span-level Python metrics scorer #4297

thiagohora · 2025-12-01T15:50:35Z

Details

This PR implements span-level Python metrics scorer functionality, allowing users to evaluate individual spans within traces using custom Python code. This extends the existing trace-level and thread-level Python scoring capabilities to spans.

Backend Changes

Core Implementation:

Added OnlineScoringSpanUserDefinedMetricPythonScorer service that:
- Consumes SpanToScoreUserDefinedMetricPython messages from Redis streams
- Prepares span data (input, output, metadata) for Python evaluator
- Executes Python code via PythonEvaluatorService
- Stores span-level feedback scores using FeedbackScoreService.scoreBatchOfSpans()
- Includes comprehensive user-facing logging for debugging

Infrastructure Updates:

Extended OnlineScoringEngine.toReplacements() to support span context (input, output, metadata field bindings)
Fixed bug in OnlineScoringSpanSampler where it only fetched SPAN_LLM_AS_JUDGE evaluators, now correctly fetches both SPAN_LLM_AS_JUDGE and SPAN_USER_DEFINED_METRIC_PYTHON evaluators
Added Redis stream configuration for SPAN_USER_DEFINED_METRIC_PYTHON in config.yml and config-test.yml
Added service toggle spanUserDefinedMetricPythonEnabled to control feature availability
Updated ManualEvaluationService to respect the new toggle when enqueueing span Python evaluations

Testing:

Added comprehensive unit tests for OnlineScoringSpanUserDefinedMetricPythonScorer
Updated OnlineScoringSpanSamplerTest to verify both evaluator types are fetched
Added CRUD integration tests for AutomationRuleEvaluatorSpanUserDefinedMetricPython in AutomationRuleEvaluatorsResourceTest

Frontend Changes

Type System:

Added span_python_code = "span_user_defined_metric_python" to EVALUATORS_RULE_TYPE enum
Created PythonCodeDetailsSpanForm interface for span-specific Python code form data
Updated PythonCodeObject and PythonCodeDetails types to include span form

Schema Validation:

Added PythonCodeDetailsSpanFormSchema with argument validation (input/output/metadata prefixes)
Added PythonCodeSpanEvaluationRuleFormSchema for span Python code rules
Updated EvaluationRuleFormSchema discriminated union to include span Python code

UI Components:

Updated AddEditRuleDialog to:
- Show "Span" scope option when either span LLM or span Python code is enabled
- Display "Code metric" toggle for span scope when feature is enabled
- Handle span Python code rule creation/editing
- Use default Python code template for span scope
Updated RunEvaluationDialog to filter and display span Python code rules for span entity type
Updated PythonCodeRuleDetails to handle arguments for span scope (similar to trace scope)

Feature Toggle:

Added SPAN_USER_DEFINED_METRIC_PYTHON_ENABLED feature toggle key
Integrated toggle into FeatureTogglesProvider with default disabled state

Constants:

Added DEFAULT_PYTHON_CODE_SPAN_DATA with default Python code template for spans

Helper Functions:

Updated getUIRuleType() to map span_python_code to python_code UI type
Updated getUIRuleScope() to map span_python_code to span scope
Updated getBackendRuleType() to support span Python code (was previously disabled)

Change checklist

User facing
Documentation update

Issues

Resolves #
OPIK-3211

Testing

Backend Tests:

✅ Unit tests for OnlineScoringSpanUserDefinedMetricPythonScorer:
- Verifies span scoring and result storage
- Tests error handling for Python evaluation failures
- Tests error handling for score storage failures
- Validates proper log context propagation
✅ Integration tests for OnlineScoringSpanSampler:
- Verifies both SPAN_LLM_AS_JUDGE and SPAN_USER_DEFINED_METRIC_PYTHON evaluators are fetched
✅ CRUD integration tests for span Python code rules:
- Create, read, update, delete operations
- Rule filtering and search functionality

Frontend Tests:

✅ TypeScript compilation passes
✅ Build succeeds
✅ Linting passes (minor prettier warning on unrelated file, non-blocking)

Manual Testing Scenarios:

Create a span-level Python code rule via UI
Verify rule appears in automation rules list
Trigger manual evaluation on spans using the rule
Verify scores are stored at span level
Verify scores appear in span details view

Documentation

Feature toggle documentation: spanUserDefinedMetricPythonEnabled toggle controls availability
API: Span Python code rules follow same pattern as trace/thread Python code rules
UI: Span scope option appears when feature toggle is enabled

… view - Add span_feedback_scores field to Trace API model - Add span score aggregation logic to TraceDAO query - Update TraceEnrichmentMapper to map aggregated span scores - Add tests for span score aggregation - Update testcontainers version to 2.0.2 - Fix getValue method to handle missing columns gracefully [OPIK-3208] [FE] Display aggregated span feedback scores in trace detail view - Add span_feedback_scores to Trace type definition - Display aggregated span feedback scores in TraceDataViewer - Add span feedback score chips in trace tree view - Update TreeDetailsStore to support span_feedback_scores - Remove individual span scores drill-down section

- Show 'Trace Feedback Scores' only when viewing a trace - Show 'Feedback Scores' when viewing a span

…evel - Make onDeleteFeedbackScore optional in FeedbackScoreTable - Hide delete actions column when deletion is disabled - Prevent deletion of aggregated span scores shown at trace level

…pan-scores-and-filters

…elling, add JavaDoc

…and-filters' of https://github.com/comet-ml/opik into thiaghora/OPIK-3209-adjust-trace-table-for-span-scores-and-filters

…eaders

…s when values exist

…-to-show-span-scores

- Extract helper functions for value parsing and display logic in ValueCell - Consolidate child row creation logic into reusable createChildRow function - Export extractAuthorName helper and use it consistently in AuthorCell - Simplify formatParentRowWithCounts by merging categorical/non-categorical logic - Add spanId field to ValueEntry API model for proper span identification This refactoring reduces code duplication by ~60 lines and improves maintainability without changing functionality.

- Create TypeCell component to display span type with icon and label - Component hides type for parent/aggregated rows - Uses BaseTraceDataTypeIcon for consistent iconography - Part of span feedback scores feature implementation

Backend changes: - Add span feedback scores aggregation in TraceDAO using CTEs - Include span_id, span_type, and category_name in value_by_author map - Use composite keys (author_spanId) to preserve individual span scores - Filter out '<no reason>' from aggregated reasons - Add spanId field to ValueEntry API model Frontend changes: - Display aggregated span feedback scores in trace detail view - Add hierarchical view with parent/child rows for multiple spans - Add Type column to show span type (LLM, Tool, General, Guardrail) - Support deletion of individual span scores using span_id - Update table titles to 'Trace scores' and 'Span scores' (Sentence case) - Fix author name display to handle composite keys correctly - Update value formatting to show counts and averages for parent rows - Remove span feedback scores tags from trace header and tree view - Update tooltips to include 'span' suffix (e.g., 'LLM span') - Fix 'All scores' summary to display actual values - Handle empty reasons gracefully (don't show '<no reason>') This feature allows users to see aggregated feedback scores from multiple spans at the trace level while maintaining the ability to view and manage individual span scores.

- Change sidebar title to 'Trace feedback scores' / 'Span feedback scores' - Change 'All scores' section to show 'Trace scores' / 'Span scores' (remove 'All scores' prefix) - Filter scores correctly: only show trace scores when viewing trace, only span scores when viewing span - Update 'Your scores' header to include entity type (Trace/Span/Thread scores) - Update Thread annotations to match the same pattern

- Replace 'span-detail' entityType with isAggregatedSpanScores prop for cleaner logic - Extract helper functions to reduce code duplication (getStorageKeyType, getConfigurableColumnsWithoutType) - Fix all linting errors and remove unused imports/variables - Improve code organization and maintainability

…an-scores' into thiaghora/OPIK-3209-adjust-trace-table-for-span-scores-and-filters

…-to-show-span-scores

…an-scores' into thiaghora/OPIK-3209-adjust-trace-table-for-span-scores-and-filters

- Added THREAD_ANNOTATION_QUEUE_IDS_ANALYTICS_DB constant for thread annotation queue IDs (ttaqi alias) - Updated TRACE_THREAD_FIELDS_MAP to use ttaqi.annotation_queue_ids for TraceThreadField.ANNOTATION_QUEUE_IDS - Updated SELECT_COUNT_TRACES_THREADS_BY_PROJECT_IDS to use ttaqi alias to match SELECT_TRACES_THREADS_BY_PROJECT_IDS - Fixes FindTraceThreads.whenFilterByAnnotationQueueId__thenReturnThreadsWithMatchingTags test failure

- Add AutomationRuleEvaluatorSpanLlmAsJudge and related models - Add SpansCreated event and publish it from SpanService - Add OnlineScoringSpanSampler to sample spans for scoring - Add OnlineScoringSpanLlmAsJudgeScorer to score spans using LLM - Add SpanFilterEvaluationService for evaluating span filters - Refactor FilterEvaluationServiceBase to reduce duplication - Update AutomationRuleEvaluator to use List<Filter> instead of List<TraceFilter> - Add AutomationRuleEvaluatorFiltersDeserializer to handle polymorphic filter deserialization - Add comprehensive tests for span filter evaluation and deserializer - Add service toggle for span LLM as Judge feature - Update OnlineScoringEngine to support span scoring - Add migration to extend automation_rule_evaluators type enum

- Add span scope option to rule creation UI - Implement span filter builder with all span fields including duration, usage, cost, errors - Add span field-binding UI with autocomplete for input/output/metadata paths - Support is_empty and is_not_empty operators for dictionary filters in rule context - Add comprehensive tests for IS_EMPTY and IS_NOT_EMPTY operators on span filters - Fix OUTPUT_JSON field extraction to properly handle nested keys - Ensure only custom LLM-as-judge template is available for span scope - Add feature toggle support for span LLM-as-judge functionality

- Restore migration 000037_add_scope_to_automation_rules.sql from main branch (it adds span_llm_as_judge evaluator type) - Update migration 000038 to only add span_user_defined_metric_python type (span_llm_as_judge is already added by 000037)

…oringSpanSampler - Check both SPAN_LLM_AS_JUDGE and SPAN_USER_DEFINED_METRIC_PYTHON toggles before fetching evaluators - Prevents unnecessary database queries when features are disabled - Update tests to explicitly verify toggle behavior for both evaluator types - Add test to verify Python evaluators are not fetched when toggle is disabled

- Remove input_json/output_json normalization logic from frontend - Backend now stores filters in same format as sent by frontend - Add CUSTOM field handling in filter evaluation services - Convert CUSTOM fields only when evaluating filters, not when storing - Add comprehensive tests for CUSTOM field evaluation

- Update extractNestedValue to handle JSON paths with array indices - Support both bracket notation (messages[0].content) and dot notation (messages.0.content) - Add tests for array index path navigation - Fixes issue where filters like 'input.messages[0].content' were not working

…l-python-scorer

…ithub.com/comet-ml/opik into thiago/OPIK-3211-span-level-python-scorer

Copilot

Pull request overview

This PR implements span-level Python metrics scoring functionality, extending the existing trace and thread-level Python evaluator capabilities to spans. The implementation adds backend services for consuming span Python evaluation messages from Redis streams, executing Python code, and storing span-level feedback scores. The frontend adds UI support for creating and managing span Python code rules through the automation rules interface, controlled by a new feature toggle.

Key changes:

Backend: New scorer service, message types, and DAO/mapper support for span Python evaluators
Frontend: UI components, schemas, and type definitions for span Python code rules
Infrastructure: Redis stream configuration and feature toggle for span Python evaluators

Reviewed changes

Copilot reviewed 32 out of 32 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`OnlineScoringSpanUserDefinedMetricPythonScorer.java`	Core scorer service that consumes Redis messages and executes Python evaluations for spans
`OnlineScoringSpanSampler.java`	Updated to fetch and process both span LLM and span Python evaluators
`ManualEvaluationService.java`	Added span evaluation method and rule type validation for manual evaluations
`AutomationRuleEvaluatorService.java`	Added CRUD operations for span Python evaluator rules
`AddEditRuleDialog.tsx`	Updated UI to support creating/editing span Python code rules
`schema.ts`	Added validation schemas for span Python code rule forms
`config.yml` / `config-test.yml`	Added Redis stream configuration for span Python evaluator
`AutomationRuleEvaluatorSpanUserDefinedMetricPython.java`	New API model for span Python evaluator rules
`SpanToScoreUserDefinedMetricPython.java`	New event message type for span Python evaluations
Migration SQL	Database migration to add span_user_defined_metric_python enum value

Comments suppressed due to low confidence (1)

apps/opik-frontend/src/components/pages-shared/automations/AddEditRuleDialog/AddEditRuleDialog.tsx:1

Duplicate key in object literal. Line 138 sets the span scope to DEFAULT_PYTHON_CODE_SPAN_DATA, and line 139 duplicates this assignment. The second assignment will overwrite the first, making line 138 unreachable.

import React, { useCallback, useEffect } from "react";

apps/opik-backend/src/main/java/com/comet/opik/domain/evaluators/ManualEvaluationService.java

...m/comet/opik/api/resources/v1/events/OnlineScoringSpanUserDefinedMetricPythonScorerTest.java

- Rename misleading variable spanLevelLlmAsJudgeRules to traceLevelLlmAsJudgeRules in ManualEvaluationService - Fix incorrect getStream() method signature in OnlineScoringSpanUserDefinedMetricPythonScorerTest - Remove duplicate key in AddEditRuleDialog DEFAULT_PYTHON_CODE_DATA object

github-actions · 2025-12-02T15:49:25Z

🔄 Test environment deployment started

Building images for PR #4297...

You can monitor the build progress here.

CometActions · 2025-12-02T16:01:04Z

✅ Test environment is now available!

Access Information

URL: https://pr-4297.dev.comet.com
Cluster: comet-ml-development
Namespace: pr-4297
Version: 1.9.37-4297-merge-621
Application logs: View in Grafana

The deployment has completed successfully and the version has been verified.

andriidudar

In general, the code looks good. I left one comment related to the feature flag—looks like in some combinations it may not work properly. Once that’s fixed, we should be good to go.

andriidudar · 2025-12-04T09:28:07Z

...pik-frontend/src/components/pages-shared/automations/AddEditRuleDialog/AddEditRuleDialog.tsx

                              </ToggleGroupItem>
-                              {isCodeMetricEnabled && !isSpanScope ? (
+                              {(isCodeMetricEnabled && !isSpanScope) ||
+                              (isSpanScope && isSpanPythonCodeEnabled) ? (


It looks like this check isn’t very accurate right now—the code block still renders even when isCodeMetricEnabled = false. Could we tighten the condition so the block only shows when the flag is enabled?

thiagohora added 30 commits November 24, 2025 12:59

[OPIK-3208] [FE] Fix feedback scores title for span view

9bda221

- Show 'Trace Feedback Scores' only when viewing a trace - Show 'Feedback Scores' when viewing a span

[OPIK-3208] [FE] Disable deletion for span feedback scores at trace l…

0b72157

…evel - Make onDeleteFeedbackScore optional in FeedbackScoreTable - Hide delete actions column when deletion is disabled - Prevent deletion of aggregated span scores shown at trace level

[OPIK-3209] Add spans feedback scores filters

63b5249

Fix remaining issues

cc3bb5d

Merge branch 'main' into thiaghora/OPIK-3209-adjust-trace-table-for-s…

09cb1a9

…pan-scores-and-filters

[OPIK-3209] Address PR review comments: add type guard, fix method sp…

fe2a8dd

…elling, add JavaDoc

Merge branch 'thiaghora/OPIK-3209-adjust-trace-table-for-span-scores-…

be7e8fb

…and-filters' of https://github.com/comet-ml/opik into thiaghora/OPIK-3209-adjust-trace-table-for-span-scores-and-filters

[OPIK-3209] [BE] Add span feedback scores statistics to trace table h…

0844419

…eaders

[OPIK-3209] [BE] Fix StatsUtils to only add span feedback scores stat…

8b7ec98

…s when values exist

[OPIK-3209] Add tests for span feedback scores statistics

71ebb16

Merge branch 'main' into thiaghora/OPIK-3208-adjust-trace-detail-view…

6d47971

…-to-show-span-scores

Merge branch 'thiaghora/OPIK-3208-adjust-trace-detail-view-to-show-sp…

583d8fd

…an-scores' into thiaghora/OPIK-3209-adjust-trace-table-for-span-scores-and-filters

Merge branch 'main' into thiaghora/OPIK-3208-adjust-trace-detail-view…

0e54d3b

…-to-show-span-scores

Merge branch 'thiaghora/OPIK-3208-adjust-trace-detail-view-to-show-sp…

b2226e9

…an-scores' into thiaghora/OPIK-3209-adjust-trace-table-for-span-scores-and-filters

Fix issues

8b93bb3

[OPIK-3209] [FE] Fix linting errors

d4608f5

Fix tests

2b81df0

Fix tests

8d26deb

Fix tests

128f6e3

Merge branch 'thiaghora/OPIK-3210-be' into thiaghora/OPIK-3210-fe

6d48485

Revision: Fix prettier formatting issues

74dd74d

github-actions bot added the typescript *.ts *.tsx label Dec 2, 2025

thiagohora changed the title ~~Thiago/opik 3211 span level python scorer~~ [OPIK-3211] [BE] [FE] Add span-level Python metrics scorer Dec 2, 2025

comet-ml deleted a comment from github-actions bot Dec 2, 2025

thiagohora added 7 commits December 2, 2025 10:21

Merge branch 'thiaghora/OPIK-3210-fe' into thiago/OPIK-3211-span-leve…

d2ae482

…l-python-scorer

[OPIK-3210] Fix frontend linting errors in schema.ts

46de8bd

Merge branch 'thiago/OPIK-3211-span-level-python-scorer' of https://g…

f56fe8d

…ithub.com/comet-ml/opik into thiago/OPIK-3211-span-level-python-scorer

Base automatically changed from thiaghora/OPIK-3210-fe to main December 2, 2025 14:04

Merge branch 'main' into thiago/OPIK-3211-span-level-python-scorer

a67d331

thiagohora marked this pull request as ready for review December 2, 2025 14:10

thiagohora requested a review from a team as a code owner December 2, 2025 14:10

Copilot AI review requested due to automatic review settings December 2, 2025 14:10

Copilot AI reviewed Dec 2, 2025

View reviewed changes

apps/opik-backend/src/main/java/com/comet/opik/domain/evaluators/ManualEvaluationService.java Outdated Show resolved Hide resolved

...m/comet/opik/api/resources/v1/events/OnlineScoringSpanUserDefinedMetricPythonScorerTest.java Outdated Show resolved Hide resolved

thiagohora added 2 commits December 2, 2025 15:39

Fix it

9ee3755

thiagohora added the test-environment Deploy Opik adhoc environment label Dec 2, 2025

thiagohora and others added 3 commits December 2, 2025 17:21

Merge branch 'main' into thiago/OPIK-3211-span-level-python-scorer

c9a7ac1

Merge branch 'main' into thiago/OPIK-3211-span-level-python-scorer

cd4e5da

Merge branch 'main' into thiago/OPIK-3211-span-level-python-scorer

3dc8728

github-actions bot assigned BorisTkachenko Dec 3, 2025

andriidudar reviewed Dec 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[OPIK-3211] [BE] [FE] Add span-level Python metrics scorer #4297

[OPIK-3211] [BE] [FE] Add span-level Python metrics scorer #4297

Uh oh!

thiagohora commented Dec 1, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Dec 2, 2025

Uh oh!

CometActions commented Dec 2, 2025

Uh oh!

andriidudar left a comment

Uh oh!

andriidudar Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[OPIK-3211] [BE] [FE] Add span-level Python metrics scorer #4297

Are you sure you want to change the base?

[OPIK-3211] [BE] [FE] Add span-level Python metrics scorer #4297

Uh oh!

Conversation

thiagohora commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details

Backend Changes

Frontend Changes

Change checklist

Issues

Testing

Documentation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Dec 2, 2025

Uh oh!

CometActions commented Dec 2, 2025

Access Information

Uh oh!

andriidudar left a comment

Choose a reason for hiding this comment

Uh oh!

andriidudar Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

thiagohora commented Dec 1, 2025 •

edited

Loading