-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[OPIK-3211] [BE] [FE] Add span-level Python metrics scorer #4297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… view - Add span_feedback_scores field to Trace API model - Add span score aggregation logic to TraceDAO query - Update TraceEnrichmentMapper to map aggregated span scores - Add tests for span score aggregation - Update testcontainers version to 2.0.2 - Fix getValue method to handle missing columns gracefully [OPIK-3208] [FE] Display aggregated span feedback scores in trace detail view - Add span_feedback_scores to Trace type definition - Display aggregated span feedback scores in TraceDataViewer - Add span feedback score chips in trace tree view - Update TreeDetailsStore to support span_feedback_scores - Remove individual span scores drill-down section
- Show 'Trace Feedback Scores' only when viewing a trace - Show 'Feedback Scores' when viewing a span
…evel - Make onDeleteFeedbackScore optional in FeedbackScoreTable - Hide delete actions column when deletion is disabled - Prevent deletion of aggregated span scores shown at trace level
…pan-scores-and-filters
…elling, add JavaDoc
…and-filters' of https://github.com/comet-ml/opik into thiaghora/OPIK-3209-adjust-trace-table-for-span-scores-and-filters
…s when values exist
…-to-show-span-scores
- Extract helper functions for value parsing and display logic in ValueCell - Consolidate child row creation logic into reusable createChildRow function - Export extractAuthorName helper and use it consistently in AuthorCell - Simplify formatParentRowWithCounts by merging categorical/non-categorical logic - Add spanId field to ValueEntry API model for proper span identification This refactoring reduces code duplication by ~60 lines and improves maintainability without changing functionality.
- Create TypeCell component to display span type with icon and label - Component hides type for parent/aggregated rows - Uses BaseTraceDataTypeIcon for consistent iconography - Part of span feedback scores feature implementation
Backend changes: - Add span feedback scores aggregation in TraceDAO using CTEs - Include span_id, span_type, and category_name in value_by_author map - Use composite keys (author_spanId) to preserve individual span scores - Filter out '<no reason>' from aggregated reasons - Add spanId field to ValueEntry API model Frontend changes: - Display aggregated span feedback scores in trace detail view - Add hierarchical view with parent/child rows for multiple spans - Add Type column to show span type (LLM, Tool, General, Guardrail) - Support deletion of individual span scores using span_id - Update table titles to 'Trace scores' and 'Span scores' (Sentence case) - Fix author name display to handle composite keys correctly - Update value formatting to show counts and averages for parent rows - Remove span feedback scores tags from trace header and tree view - Update tooltips to include 'span' suffix (e.g., 'LLM span') - Fix 'All scores' summary to display actual values - Handle empty reasons gracefully (don't show '<no reason>') This feature allows users to see aggregated feedback scores from multiple spans at the trace level while maintaining the ability to view and manage individual span scores.
- Change sidebar title to 'Trace feedback scores' / 'Span feedback scores' - Change 'All scores' section to show 'Trace scores' / 'Span scores' (remove 'All scores' prefix) - Filter scores correctly: only show trace scores when viewing trace, only span scores when viewing span - Update 'Your scores' header to include entity type (Trace/Span/Thread scores) - Update Thread annotations to match the same pattern
- Replace 'span-detail' entityType with isAggregatedSpanScores prop for cleaner logic - Extract helper functions to reduce code duplication (getStorageKeyType, getConfigurableColumnsWithoutType) - Fix all linting errors and remove unused imports/variables - Improve code organization and maintainability
…an-scores' into thiaghora/OPIK-3209-adjust-trace-table-for-span-scores-and-filters
…-to-show-span-scores
…an-scores' into thiaghora/OPIK-3209-adjust-trace-table-for-span-scores-and-filters
- Added THREAD_ANNOTATION_QUEUE_IDS_ANALYTICS_DB constant for thread annotation queue IDs (ttaqi alias) - Updated TRACE_THREAD_FIELDS_MAP to use ttaqi.annotation_queue_ids for TraceThreadField.ANNOTATION_QUEUE_IDS - Updated SELECT_COUNT_TRACES_THREADS_BY_PROJECT_IDS to use ttaqi alias to match SELECT_TRACES_THREADS_BY_PROJECT_IDS - Fixes FindTraceThreads.whenFilterByAnnotationQueueId__thenReturnThreadsWithMatchingTags test failure
- Add AutomationRuleEvaluatorSpanLlmAsJudge and related models - Add SpansCreated event and publish it from SpanService - Add OnlineScoringSpanSampler to sample spans for scoring - Add OnlineScoringSpanLlmAsJudgeScorer to score spans using LLM - Add SpanFilterEvaluationService for evaluating span filters - Refactor FilterEvaluationServiceBase to reduce duplication - Update AutomationRuleEvaluator to use List<Filter> instead of List<TraceFilter> - Add AutomationRuleEvaluatorFiltersDeserializer to handle polymorphic filter deserialization - Add comprehensive tests for span filter evaluation and deserializer - Add service toggle for span LLM as Judge feature - Update OnlineScoringEngine to support span scoring - Add migration to extend automation_rule_evaluators type enum
- Add span scope option to rule creation UI - Implement span filter builder with all span fields including duration, usage, cost, errors - Add span field-binding UI with autocomplete for input/output/metadata paths - Support is_empty and is_not_empty operators for dictionary filters in rule context - Add comprehensive tests for IS_EMPTY and IS_NOT_EMPTY operators on span filters - Fix OUTPUT_JSON field extraction to properly handle nested keys - Ensure only custom LLM-as-judge template is available for span scope - Add feature toggle support for span LLM-as-judge functionality
- Restore migration 000037_add_scope_to_automation_rules.sql from main branch (it adds span_llm_as_judge evaluator type) - Update migration 000038 to only add span_user_defined_metric_python type (span_llm_as_judge is already added by 000037)
…oringSpanSampler - Check both SPAN_LLM_AS_JUDGE and SPAN_USER_DEFINED_METRIC_PYTHON toggles before fetching evaluators - Prevents unnecessary database queries when features are disabled - Update tests to explicitly verify toggle behavior for both evaluator types - Add test to verify Python evaluators are not fetched when toggle is disabled
- Remove input_json/output_json normalization logic from frontend - Backend now stores filters in same format as sent by frontend - Add CUSTOM field handling in filter evaluation services - Convert CUSTOM fields only when evaluating filters, not when storing - Add comprehensive tests for CUSTOM field evaluation
- Update extractNestedValue to handle JSON paths with array indices - Support both bracket notation (messages[0].content) and dot notation (messages.0.content) - Add tests for array index path navigation - Fixes issue where filters like 'input.messages[0].content' were not working
…ithub.com/comet-ml/opik into thiago/OPIK-3211-span-level-python-scorer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements span-level Python metrics scoring functionality, extending the existing trace and thread-level Python evaluator capabilities to spans. The implementation adds backend services for consuming span Python evaluation messages from Redis streams, executing Python code, and storing span-level feedback scores. The frontend adds UI support for creating and managing span Python code rules through the automation rules interface, controlled by a new feature toggle.
Key changes:
- Backend: New scorer service, message types, and DAO/mapper support for span Python evaluators
- Frontend: UI components, schemas, and type definitions for span Python code rules
- Infrastructure: Redis stream configuration and feature toggle for span Python evaluators
Reviewed changes
Copilot reviewed 32 out of 32 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
OnlineScoringSpanUserDefinedMetricPythonScorer.java |
Core scorer service that consumes Redis messages and executes Python evaluations for spans |
OnlineScoringSpanSampler.java |
Updated to fetch and process both span LLM and span Python evaluators |
ManualEvaluationService.java |
Added span evaluation method and rule type validation for manual evaluations |
AutomationRuleEvaluatorService.java |
Added CRUD operations for span Python evaluator rules |
AddEditRuleDialog.tsx |
Updated UI to support creating/editing span Python code rules |
schema.ts |
Added validation schemas for span Python code rule forms |
config.yml / config-test.yml |
Added Redis stream configuration for span Python evaluator |
AutomationRuleEvaluatorSpanUserDefinedMetricPython.java |
New API model for span Python evaluator rules |
SpanToScoreUserDefinedMetricPython.java |
New event message type for span Python evaluations |
| Migration SQL | Database migration to add span_user_defined_metric_python enum value |
Comments suppressed due to low confidence (1)
apps/opik-frontend/src/components/pages-shared/automations/AddEditRuleDialog/AddEditRuleDialog.tsx:1
- Duplicate key in object literal. Line 138 sets the span scope to
DEFAULT_PYTHON_CODE_SPAN_DATA, and line 139 duplicates this assignment. The second assignment will overwrite the first, making line 138 unreachable.
import React, { useCallback, useEffect } from "react";
apps/opik-backend/src/main/java/com/comet/opik/domain/evaluators/ManualEvaluationService.java
Outdated
Show resolved
Hide resolved
...m/comet/opik/api/resources/v1/events/OnlineScoringSpanUserDefinedMetricPythonScorerTest.java
Outdated
Show resolved
Hide resolved
- Rename misleading variable spanLevelLlmAsJudgeRules to traceLevelLlmAsJudgeRules in ManualEvaluationService - Fix incorrect getStream() method signature in OnlineScoringSpanUserDefinedMetricPythonScorerTest - Remove duplicate key in AddEditRuleDialog DEFAULT_PYTHON_CODE_DATA object
|
✅ Test environment is now available! Access Information
The deployment has completed successfully and the version has been verified. |
andriidudar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, the code looks good. I left one comment related to the feature flag—looks like in some combinations it may not work properly. Once that’s fixed, we should be good to go.
| </ToggleGroupItem> | ||
| {isCodeMetricEnabled && !isSpanScope ? ( | ||
| {(isCodeMetricEnabled && !isSpanScope) || | ||
| (isSpanScope && isSpanPythonCodeEnabled) ? ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this check isn’t very accurate right now—the code block still renders even when isCodeMetricEnabled = false. Could we tighten the condition so the block only shows when the flag is enabled?
Details
This PR implements span-level Python metrics scorer functionality, allowing users to evaluate individual spans within traces using custom Python code. This extends the existing trace-level and thread-level Python scoring capabilities to spans.
Backend Changes
Core Implementation:
OnlineScoringSpanUserDefinedMetricPythonScorerservice that:SpanToScoreUserDefinedMetricPythonmessages from Redis streamsPythonEvaluatorServiceFeedbackScoreService.scoreBatchOfSpans()Infrastructure Updates:
OnlineScoringEngine.toReplacements()to support span context (input, output, metadata field bindings)OnlineScoringSpanSamplerwhere it only fetchedSPAN_LLM_AS_JUDGEevaluators, now correctly fetches bothSPAN_LLM_AS_JUDGEandSPAN_USER_DEFINED_METRIC_PYTHONevaluatorsSPAN_USER_DEFINED_METRIC_PYTHONinconfig.ymlandconfig-test.ymlspanUserDefinedMetricPythonEnabledto control feature availabilityManualEvaluationServiceto respect the new toggle when enqueueing span Python evaluationsTesting:
OnlineScoringSpanUserDefinedMetricPythonScorerOnlineScoringSpanSamplerTestto verify both evaluator types are fetchedAutomationRuleEvaluatorSpanUserDefinedMetricPythoninAutomationRuleEvaluatorsResourceTestFrontend Changes
Type System:
span_python_code = "span_user_defined_metric_python"toEVALUATORS_RULE_TYPEenumPythonCodeDetailsSpanForminterface for span-specific Python code form dataPythonCodeObjectandPythonCodeDetailstypes to include span formSchema Validation:
PythonCodeDetailsSpanFormSchemawith argument validation (input/output/metadata prefixes)PythonCodeSpanEvaluationRuleFormSchemafor span Python code rulesEvaluationRuleFormSchemadiscriminated union to include span Python codeUI Components:
AddEditRuleDialogto:RunEvaluationDialogto filter and display span Python code rules for span entity typePythonCodeRuleDetailsto handle arguments for span scope (similar to trace scope)Feature Toggle:
SPAN_USER_DEFINED_METRIC_PYTHON_ENABLEDfeature toggle keyFeatureTogglesProviderwith default disabled stateConstants:
DEFAULT_PYTHON_CODE_SPAN_DATAwith default Python code template for spansHelper Functions:
getUIRuleType()to mapspan_python_codetopython_codeUI typegetUIRuleScope()to mapspan_python_codetospanscopegetBackendRuleType()to support span Python code (was previously disabled)Change checklist
Issues
Testing
Backend Tests:
OnlineScoringSpanUserDefinedMetricPythonScorer:OnlineScoringSpanSampler:SPAN_LLM_AS_JUDGEandSPAN_USER_DEFINED_METRIC_PYTHONevaluators are fetchedFrontend Tests:
Manual Testing Scenarios:
Documentation
spanUserDefinedMetricPythonEnabledtoggle controls availability