Skip to content

Conversation

@thiagohora
Copy link
Contributor

@thiagohora thiagohora commented Dec 1, 2025

Details

This PR implements span-level Python metrics scorer functionality, allowing users to evaluate individual spans within traces using custom Python code. This extends the existing trace-level and thread-level Python scoring capabilities to spans.

Backend Changes

Core Implementation:

  • Added OnlineScoringSpanUserDefinedMetricPythonScorer service that:
    • Consumes SpanToScoreUserDefinedMetricPython messages from Redis streams
    • Prepares span data (input, output, metadata) for Python evaluator
    • Executes Python code via PythonEvaluatorService
    • Stores span-level feedback scores using FeedbackScoreService.scoreBatchOfSpans()
    • Includes comprehensive user-facing logging for debugging

Infrastructure Updates:

  • Extended OnlineScoringEngine.toReplacements() to support span context (input, output, metadata field bindings)
  • Fixed bug in OnlineScoringSpanSampler where it only fetched SPAN_LLM_AS_JUDGE evaluators, now correctly fetches both SPAN_LLM_AS_JUDGE and SPAN_USER_DEFINED_METRIC_PYTHON evaluators
  • Added Redis stream configuration for SPAN_USER_DEFINED_METRIC_PYTHON in config.yml and config-test.yml
  • Added service toggle spanUserDefinedMetricPythonEnabled to control feature availability
  • Updated ManualEvaluationService to respect the new toggle when enqueueing span Python evaluations

Testing:

  • Added comprehensive unit tests for OnlineScoringSpanUserDefinedMetricPythonScorer
  • Updated OnlineScoringSpanSamplerTest to verify both evaluator types are fetched
  • Added CRUD integration tests for AutomationRuleEvaluatorSpanUserDefinedMetricPython in AutomationRuleEvaluatorsResourceTest

Frontend Changes

Type System:

  • Added span_python_code = "span_user_defined_metric_python" to EVALUATORS_RULE_TYPE enum
  • Created PythonCodeDetailsSpanForm interface for span-specific Python code form data
  • Updated PythonCodeObject and PythonCodeDetails types to include span form

Schema Validation:

  • Added PythonCodeDetailsSpanFormSchema with argument validation (input/output/metadata prefixes)
  • Added PythonCodeSpanEvaluationRuleFormSchema for span Python code rules
  • Updated EvaluationRuleFormSchema discriminated union to include span Python code

UI Components:

  • Updated AddEditRuleDialog to:
    • Show "Span" scope option when either span LLM or span Python code is enabled
    • Display "Code metric" toggle for span scope when feature is enabled
    • Handle span Python code rule creation/editing
    • Use default Python code template for span scope
  • Updated RunEvaluationDialog to filter and display span Python code rules for span entity type
  • Updated PythonCodeRuleDetails to handle arguments for span scope (similar to trace scope)

Feature Toggle:

  • Added SPAN_USER_DEFINED_METRIC_PYTHON_ENABLED feature toggle key
  • Integrated toggle into FeatureTogglesProvider with default disabled state

Constants:

  • Added DEFAULT_PYTHON_CODE_SPAN_DATA with default Python code template for spans

Helper Functions:

  • Updated getUIRuleType() to map span_python_code to python_code UI type
  • Updated getUIRuleScope() to map span_python_code to span scope
  • Updated getBackendRuleType() to support span Python code (was previously disabled)

Change checklist

  • User facing
  • Documentation update

Issues

  • Resolves #
  • OPIK-3211

Testing

Backend Tests:

  • ✅ Unit tests for OnlineScoringSpanUserDefinedMetricPythonScorer:
    • Verifies span scoring and result storage
    • Tests error handling for Python evaluation failures
    • Tests error handling for score storage failures
    • Validates proper log context propagation
  • ✅ Integration tests for OnlineScoringSpanSampler:
    • Verifies both SPAN_LLM_AS_JUDGE and SPAN_USER_DEFINED_METRIC_PYTHON evaluators are fetched
  • ✅ CRUD integration tests for span Python code rules:
    • Create, read, update, delete operations
    • Rule filtering and search functionality

Frontend Tests:

  • ✅ TypeScript compilation passes
  • ✅ Build succeeds
  • ✅ Linting passes (minor prettier warning on unrelated file, non-blocking)

Manual Testing Scenarios:

  1. Create a span-level Python code rule via UI
  2. Verify rule appears in automation rules list
  3. Trigger manual evaluation on spans using the rule
  4. Verify scores are stored at span level
  5. Verify scores appear in span details view

Documentation

  • Feature toggle documentation: spanUserDefinedMetricPythonEnabled toggle controls availability
  • API: Span Python code rules follow same pattern as trace/thread Python code rules
  • UI: Span scope option appears when feature toggle is enabled

… view

- Add span_feedback_scores field to Trace API model
- Add span score aggregation logic to TraceDAO query
- Update TraceEnrichmentMapper to map aggregated span scores
- Add tests for span score aggregation
- Update testcontainers version to 2.0.2
- Fix getValue method to handle missing columns gracefully

[OPIK-3208] [FE] Display aggregated span feedback scores in trace detail view

- Add span_feedback_scores to Trace type definition
- Display aggregated span feedback scores in TraceDataViewer
- Add span feedback score chips in trace tree view
- Update TreeDetailsStore to support span_feedback_scores
- Remove individual span scores drill-down section
- Show 'Trace Feedback Scores' only when viewing a trace
- Show 'Feedback Scores' when viewing a span
…evel

- Make onDeleteFeedbackScore optional in FeedbackScoreTable
- Hide delete actions column when deletion is disabled
- Prevent deletion of aggregated span scores shown at trace level
…and-filters' of https://github.com/comet-ml/opik into thiaghora/OPIK-3209-adjust-trace-table-for-span-scores-and-filters
- Extract helper functions for value parsing and display logic in ValueCell
- Consolidate child row creation logic into reusable createChildRow function
- Export extractAuthorName helper and use it consistently in AuthorCell
- Simplify formatParentRowWithCounts by merging categorical/non-categorical logic
- Add spanId field to ValueEntry API model for proper span identification

This refactoring reduces code duplication by ~60 lines and improves maintainability
without changing functionality.
- Create TypeCell component to display span type with icon and label
- Component hides type for parent/aggregated rows
- Uses BaseTraceDataTypeIcon for consistent iconography
- Part of span feedback scores feature implementation
Backend changes:
- Add span feedback scores aggregation in TraceDAO using CTEs
- Include span_id, span_type, and category_name in value_by_author map
- Use composite keys (author_spanId) to preserve individual span scores
- Filter out '<no reason>' from aggregated reasons
- Add spanId field to ValueEntry API model

Frontend changes:
- Display aggregated span feedback scores in trace detail view
- Add hierarchical view with parent/child rows for multiple spans
- Add Type column to show span type (LLM, Tool, General, Guardrail)
- Support deletion of individual span scores using span_id
- Update table titles to 'Trace scores' and 'Span scores' (Sentence case)
- Fix author name display to handle composite keys correctly
- Update value formatting to show counts and averages for parent rows
- Remove span feedback scores tags from trace header and tree view
- Update tooltips to include 'span' suffix (e.g., 'LLM span')
- Fix 'All scores' summary to display actual values
- Handle empty reasons gracefully (don't show '<no reason>')

This feature allows users to see aggregated feedback scores from multiple spans
at the trace level while maintaining the ability to view and manage individual
span scores.
- Change sidebar title to 'Trace feedback scores' / 'Span feedback scores'
- Change 'All scores' section to show 'Trace scores' / 'Span scores' (remove 'All scores' prefix)
- Filter scores correctly: only show trace scores when viewing trace, only span scores when viewing span
- Update 'Your scores' header to include entity type (Trace/Span/Thread scores)
- Update Thread annotations to match the same pattern
- Replace 'span-detail' entityType with isAggregatedSpanScores prop for cleaner logic
- Extract helper functions to reduce code duplication (getStorageKeyType, getConfigurableColumnsWithoutType)
- Fix all linting errors and remove unused imports/variables
- Improve code organization and maintainability
…an-scores' into thiaghora/OPIK-3209-adjust-trace-table-for-span-scores-and-filters
…an-scores' into thiaghora/OPIK-3209-adjust-trace-table-for-span-scores-and-filters
- Added THREAD_ANNOTATION_QUEUE_IDS_ANALYTICS_DB constant for thread annotation queue IDs (ttaqi alias)
- Updated TRACE_THREAD_FIELDS_MAP to use ttaqi.annotation_queue_ids for TraceThreadField.ANNOTATION_QUEUE_IDS
- Updated SELECT_COUNT_TRACES_THREADS_BY_PROJECT_IDS to use ttaqi alias to match SELECT_TRACES_THREADS_BY_PROJECT_IDS
- Fixes FindTraceThreads.whenFilterByAnnotationQueueId__thenReturnThreadsWithMatchingTags test failure
- Add AutomationRuleEvaluatorSpanLlmAsJudge and related models
- Add SpansCreated event and publish it from SpanService
- Add OnlineScoringSpanSampler to sample spans for scoring
- Add OnlineScoringSpanLlmAsJudgeScorer to score spans using LLM
- Add SpanFilterEvaluationService for evaluating span filters
- Refactor FilterEvaluationServiceBase to reduce duplication
- Update AutomationRuleEvaluator to use List<Filter> instead of List<TraceFilter>
- Add AutomationRuleEvaluatorFiltersDeserializer to handle polymorphic filter deserialization
- Add comprehensive tests for span filter evaluation and deserializer
- Add service toggle for span LLM as Judge feature
- Update OnlineScoringEngine to support span scoring
- Add migration to extend automation_rule_evaluators type enum
- Add span scope option to rule creation UI
- Implement span filter builder with all span fields including duration, usage, cost, errors
- Add span field-binding UI with autocomplete for input/output/metadata paths
- Support is_empty and is_not_empty operators for dictionary filters in rule context
- Add comprehensive tests for IS_EMPTY and IS_NOT_EMPTY operators on span filters
- Fix OUTPUT_JSON field extraction to properly handle nested keys
- Ensure only custom LLM-as-judge template is available for span scope
- Add feature toggle support for span LLM-as-judge functionality
@github-actions github-actions bot added the typescript *.ts *.tsx label Dec 2, 2025
@thiagohora thiagohora changed the title Thiago/opik 3211 span level python scorer [OPIK-3211] [BE] [FE] Add span-level Python metrics scorer Dec 2, 2025
@comet-ml comet-ml deleted a comment from github-actions bot Dec 2, 2025
@comet-ml comet-ml deleted a comment from github-actions bot Dec 2, 2025
@comet-ml comet-ml deleted a comment from github-actions bot Dec 2, 2025
@comet-ml comet-ml deleted a comment from github-actions bot Dec 2, 2025
@comet-ml comet-ml deleted a comment from github-actions bot Dec 2, 2025
- Restore migration 000037_add_scope_to_automation_rules.sql from main branch
  (it adds span_llm_as_judge evaluator type)
- Update migration 000038 to only add span_user_defined_metric_python type
  (span_llm_as_judge is already added by 000037)
…oringSpanSampler

- Check both SPAN_LLM_AS_JUDGE and SPAN_USER_DEFINED_METRIC_PYTHON toggles before fetching evaluators
- Prevents unnecessary database queries when features are disabled
- Update tests to explicitly verify toggle behavior for both evaluator types
- Add test to verify Python evaluators are not fetched when toggle is disabled
- Remove input_json/output_json normalization logic from frontend
- Backend now stores filters in same format as sent by frontend
- Add CUSTOM field handling in filter evaluation services
- Convert CUSTOM fields only when evaluating filters, not when storing
- Add comprehensive tests for CUSTOM field evaluation
- Update extractNestedValue to handle JSON paths with array indices
- Support both bracket notation (messages[0].content) and dot notation (messages.0.content)
- Add tests for array index path navigation
- Fixes issue where filters like 'input.messages[0].content' were not working
Base automatically changed from thiaghora/OPIK-3210-fe to main December 2, 2025 14:04
@thiagohora thiagohora marked this pull request as ready for review December 2, 2025 14:10
@thiagohora thiagohora requested a review from a team as a code owner December 2, 2025 14:10
Copilot AI review requested due to automatic review settings December 2, 2025 14:10
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements span-level Python metrics scoring functionality, extending the existing trace and thread-level Python evaluator capabilities to spans. The implementation adds backend services for consuming span Python evaluation messages from Redis streams, executing Python code, and storing span-level feedback scores. The frontend adds UI support for creating and managing span Python code rules through the automation rules interface, controlled by a new feature toggle.

Key changes:

  • Backend: New scorer service, message types, and DAO/mapper support for span Python evaluators
  • Frontend: UI components, schemas, and type definitions for span Python code rules
  • Infrastructure: Redis stream configuration and feature toggle for span Python evaluators

Reviewed changes

Copilot reviewed 32 out of 32 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
OnlineScoringSpanUserDefinedMetricPythonScorer.java Core scorer service that consumes Redis messages and executes Python evaluations for spans
OnlineScoringSpanSampler.java Updated to fetch and process both span LLM and span Python evaluators
ManualEvaluationService.java Added span evaluation method and rule type validation for manual evaluations
AutomationRuleEvaluatorService.java Added CRUD operations for span Python evaluator rules
AddEditRuleDialog.tsx Updated UI to support creating/editing span Python code rules
schema.ts Added validation schemas for span Python code rule forms
config.yml / config-test.yml Added Redis stream configuration for span Python evaluator
AutomationRuleEvaluatorSpanUserDefinedMetricPython.java New API model for span Python evaluator rules
SpanToScoreUserDefinedMetricPython.java New event message type for span Python evaluations
Migration SQL Database migration to add span_user_defined_metric_python enum value
Comments suppressed due to low confidence (1)

apps/opik-frontend/src/components/pages-shared/automations/AddEditRuleDialog/AddEditRuleDialog.tsx:1

  • Duplicate key in object literal. Line 138 sets the span scope to DEFAULT_PYTHON_CODE_SPAN_DATA, and line 139 duplicates this assignment. The second assignment will overwrite the first, making line 138 unreachable.
import React, { useCallback, useEffect } from "react";

- Rename misleading variable spanLevelLlmAsJudgeRules to traceLevelLlmAsJudgeRules in ManualEvaluationService
- Fix incorrect getStream() method signature in OnlineScoringSpanUserDefinedMetricPythonScorerTest
- Remove duplicate key in AddEditRuleDialog DEFAULT_PYTHON_CODE_DATA object
@thiagohora thiagohora added the test-environment Deploy Opik adhoc environment label Dec 2, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Dec 2, 2025

🔄 Test environment deployment started

Building images for PR #4297...

You can monitor the build progress here.

@CometActions
Copy link
Collaborator

Test environment is now available!

Access Information

The deployment has completed successfully and the version has been verified.

Copy link
Contributor

@andriidudar andriidudar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, the code looks good. I left one comment related to the feature flag—looks like in some combinations it may not work properly. Once that’s fixed, we should be good to go.

</ToggleGroupItem>
{isCodeMetricEnabled && !isSpanScope ? (
{(isCodeMetricEnabled && !isSpanScope) ||
(isSpanScope && isSpanPythonCodeEnabled) ? (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this check isn’t very accurate right now—the code block still renders even when isCodeMetricEnabled = false. Could we tighten the condition so the block only shows when the flag is enabled?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Backend Frontend java Pull requests that update Java code test-environment Deploy Opik adhoc environment tests Including test files, or tests related like configuration. typescript *.ts *.tsx

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants