-
Notifications
You must be signed in to change notification settings - Fork 0
feat(metrics): add comprehensive submission debugging metrics with env control #29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @auricom, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the monitoring capabilities for the data availability (DA) submission process by introducing a set of new Prometheus metrics. These metrics provide granular insights into the total number of submission attempts, distinguish between successful and failed submissions, and track the timestamps of these critical events. The changes aim to improve system observability, making it easier to monitor performance and troubleshoot issues related to DA submissions. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces new Prometheus metrics to track DA submission attempts, including total attempts, failures, and timestamps for the last attempt and last successful submission. The changes are logical and the new metrics will provide valuable insights. My review focuses on improving logging consistency and test code quality.
Specifically, I've pointed out that the new debug logs use the standard log package, which is inconsistent with the project's use of zerolog. I've also suggested refactoring the new test function in pkg/metrics/metrics_test.go to improve its structure and reduce code duplication, which will make it more maintainable.
| if success { | ||
| m.LastSuccessfulSubmissionTime.WithLabelValues(chainID, submissionType).Set(float64(now.Unix())) | ||
| log.Printf("DEBUG: Successful submission - chain: %s, type: %s, timestamp: %d", chainID, submissionType, now.Unix()) | ||
| } else { | ||
| log.Printf("DEBUG: Failed submission attempt - chain: %s, type: %s, timestamp: %d", chainID, submissionType, now.Unix()) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of log.Printf for debugging introduces an inconsistency with the project's logging approach, which appears to favor the structured logger zerolog (as seen in pkg/exporters/verifier/verifier.go). Using the standard log package can be problematic as it lacks levels and structured output.
For better consistency and maintainability, please consider one of the following:
- Replace these
log.Printfcalls with a properzerologlogger instance. You might need to add a logger field to theMetricsstruct or pass it into the methods. - If these logs are for temporary debugging purposes, they should be removed before this PR is merged.
This comment applies to all new log.Printf calls added in this file.
| func TestMetrics_RecordSubmissionAttempt(t *testing.T) { | ||
| reg := prometheus.NewRegistry() | ||
| m := NewWithRegistry("test", reg) | ||
|
|
||
| // Test successful submission | ||
| beforeAttempt := time.Now() | ||
| m.RecordSubmissionAttempt("testchain", "header", true) | ||
| afterAttempt := time.Now() | ||
|
|
||
| // Verify counters | ||
| attempts := getMetricValue(t, reg, "test_submission_attempts_total", map[string]string{ | ||
| "chain_id": "testchain", | ||
| "type": "header", | ||
| }) | ||
| require.Equal(t, float64(1), attempts, "should have 1 attempt") | ||
|
|
||
| // For successful submission, failures should be 0 (metric may not be exported if 0) | ||
| // We'll check if the metric exists and has value 0, or doesn't exist (both are valid) | ||
| failuresMetricFound := false | ||
| failures := float64(0) | ||
| metrics, err := reg.Gather() | ||
| require.NoError(t, err) | ||
| for _, mf := range metrics { | ||
| if mf.GetName() == "test_submission_failures_total" { | ||
| for _, m := range mf.GetMetric() { | ||
| // check if labels match | ||
| match := true | ||
| for _, label := range m.GetLabel() { | ||
| if expectedVal, ok := map[string]string{"chain_id": "testchain", "type": "header"}[label.GetName()]; ok { | ||
| if label.GetValue() != expectedVal { | ||
| match = false | ||
| break | ||
| } | ||
| } | ||
| } | ||
| if match && len(m.GetLabel()) == 2 { | ||
| failuresMetricFound = true | ||
| if m.GetCounter() != nil { | ||
| failures = m.GetCounter().GetValue() | ||
| } | ||
| break | ||
| } | ||
| } | ||
| } | ||
| } | ||
| if failuresMetricFound { | ||
| require.Equal(t, float64(0), failures, "should have 0 failures") | ||
| } // else: metric not exported because value is 0, which is expected behavior | ||
|
|
||
| // Verify timestamps are within expected range | ||
| lastAttemptTime := getMetricValue(t, reg, "test_last_submission_attempt_time", map[string]string{ | ||
| "chain_id": "testchain", | ||
| "type": "header", | ||
| }) | ||
| require.GreaterOrEqual(t, lastAttemptTime, float64(beforeAttempt.Unix())) | ||
| require.LessOrEqual(t, lastAttemptTime, float64(afterAttempt.Unix())) | ||
|
|
||
| lastSuccessTime := getMetricValue(t, reg, "test_last_successful_submission_time", map[string]string{ | ||
| "chain_id": "testchain", | ||
| "type": "header", | ||
| }) | ||
| require.GreaterOrEqual(t, lastSuccessTime, float64(beforeAttempt.Unix())) | ||
| require.LessOrEqual(t, lastSuccessTime, float64(afterAttempt.Unix())) | ||
|
|
||
| // Test failed submission | ||
| beforeFailure := time.Now() | ||
| m.RecordSubmissionAttempt("testchain", "data", false) | ||
| afterFailure := time.Now() | ||
|
|
||
| // Verify counters | ||
| attempts = getMetricValue(t, reg, "test_submission_attempts_total", map[string]string{ | ||
| "chain_id": "testchain", | ||
| "type": "data", | ||
| }) | ||
| require.Equal(t, float64(1), attempts, "should have 1 attempt for data") | ||
|
|
||
| failures = getMetricValue(t, reg, "test_submission_failures_total", map[string]string{ | ||
| "chain_id": "testchain", | ||
| "type": "data", | ||
| }) | ||
| require.Equal(t, float64(1), failures, "should have 1 failure for data") | ||
|
|
||
| // Verify timestamps - should have attempt time but not success time for failed submission | ||
| lastAttemptTime = getMetricValue(t, reg, "test_last_submission_attempt_time", map[string]string{ | ||
| "chain_id": "testchain", | ||
| "type": "data", | ||
| }) | ||
| require.GreaterOrEqual(t, lastAttemptTime, float64(beforeFailure.Unix())) | ||
| require.LessOrEqual(t, lastAttemptTime, float64(afterFailure.Unix())) | ||
|
|
||
| // Last successful submission time should still be 0 for data type (never succeeded) | ||
| // Gauge metrics with 0 values may not be exported, so we need to check if it exists | ||
| var lastSuccessTimeData float64 | ||
| var successMetricFoundData bool | ||
| metrics, err = reg.Gather() | ||
| require.NoError(t, err) | ||
| for _, mf := range metrics { | ||
| if mf.GetName() == "test_last_successful_submission_time" { | ||
| for _, m := range mf.GetMetric() { | ||
| // check if labels match | ||
| match := true | ||
| for _, label := range m.GetLabel() { | ||
| if expectedVal, ok := map[string]string{"chain_id": "testchain", "type": "data"}[label.GetName()]; ok { | ||
| if label.GetValue() != expectedVal { | ||
| match = false | ||
| break | ||
| } | ||
| } | ||
| } | ||
| if match && len(m.GetLabel()) == 2 { | ||
| successMetricFoundData = true | ||
| if m.GetGauge() != nil { | ||
| lastSuccessTimeData = m.GetGauge().GetValue() | ||
| } | ||
| break | ||
| } | ||
| } | ||
| } | ||
| } | ||
| if successMetricFoundData { | ||
| require.Equal(t, float64(0), lastSuccessTimeData, "should have no successful submission time for data") | ||
| } // else: metric not exported because value is 0, which is expected behavior | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test function is quite long and contains significant code duplication, especially for verifying metrics that might not be present (e.g., when their value is 0). This makes the test harder to read and maintain.
I suggest refactoring it to improve clarity and reusability:
-
Use Sub-tests: Split the logic for successful and failed submissions into separate sub-tests using
t.Run(). This will improve test organization and make test failures easier to diagnose. -
Create a Helper for Optional Metrics: The logic to check for metrics that might be zero (and thus not exported by Prometheus) is repeated. You could extract this into a helper function, for example
getMetricValueOptional, that returns(float64, bool)to indicate if the metric was found. This would greatly simplify the test body.
Here's a sketch of how the refactored test could look:
func TestMetrics_RecordSubmissionAttempt(t *testing.T) {
reg := prometheus.NewRegistry()
m := NewWithRegistry("test", reg)
t.Run("successful submission", func(t *testing.T) {
beforeAttempt := time.Now()
m.RecordSubmissionAttempt("testchain", "header", true)
afterAttempt := time.Now()
attempts := getMetricValue(t, reg, "test_submission_attempts_total", map[string]string{"chain_id": "testchain", "type": "header"})
require.Equal(t, float64(1), attempts)
// Using a new helper for optional metrics
failures, found := getMetricValueOptional(t, reg, "test_submission_failures_total", map[string]string{"chain_id": "testchain", "type": "header"})
if found {
require.Equal(t, float64(0), failures)
}
// ... timestamp checks
})
t.Run("failed submission", func(t *testing.T) {
// ... similar logic for failed case
})
}7527aa3 to
5ecb0d0
Compare
Problem
The
ev_metrics_submission_*metrics were sometimes not displayed even when Celestia RPC endpoints appeared to be configured properly. This made it difficult to debug submission issues.Solution
This PR adds comprehensive debugging metrics and logging to provide complete visibility into the DA submission process, with environment variable control for debug logging.
Changes Made
New Metrics
submission_attempts_total- Counter tracking all submission attempts (successful + failed)submission_failures_total- Counter tracking only failed submission attemptslast_submission_attempt_time- Gauge recording timestamp of last attemptlast_successful_submission_time- Gauge recording timestamp of last successEnvironment Variable Control
EVMETRICS_DEBUGenvironment variable to control debug loggingEVMETRICS_DEBUG=trueto enable verbose debug loggingEVMETRICS_DEBUG=falseor unset to disable debug loggingEnhanced Debug Logging
RecordSubmissionAttemptmethodRecordSubmissionDaHeightmethodEVMETRICS_DEBUGenvironment variableCode Changes
pkg/metrics/metrics.go:isDebugEnabled()helper functionMetricsstructNewWithRegistryRecordSubmissionAttemptmethodpkg/exporters/verifier/verifier.go:onVerifiedmethod to callRecordSubmissionAttemptpkg/metrics/metrics_test.go:RecordSubmissionAttemptfunctionalityBenefits
Before: When submissions failed, there was no visibility into:
After: Now you get complete visibility:
Usage
Enable debug logging:
export EVMETRICS_DEBUG=true ./ev-metricsDisable debug logging (default):
unset EVMETRICS_DEBUG ./ev-metricsExample Metrics
Working submissions:
Failing submissions:
Debug Log Examples (when EVMETRICS_DEBUG=true)
Testing
Breaking Changes
None - This is a purely additive change that enhances observability without modifying existing behavior.