Skip to content

Conversation

@auricom
Copy link
Collaborator

@auricom auricom commented Dec 22, 2025

Problem

The ev_metrics_submission_* metrics were sometimes not displayed even when Celestia RPC endpoints appeared to be configured properly. This made it difficult to debug submission issues.

Solution

This PR adds comprehensive debugging metrics and logging to provide complete visibility into the DA submission process, with environment variable control for debug logging.

Changes Made

New Metrics

  1. submission_attempts_total - Counter tracking all submission attempts (successful + failed)
  2. submission_failures_total - Counter tracking only failed submission attempts
  3. last_submission_attempt_time - Gauge recording timestamp of last attempt
  4. last_successful_submission_time - Gauge recording timestamp of last success

Environment Variable Control

  • Added EVMETRICS_DEBUG environment variable to control debug logging
  • Set EVMETRICS_DEBUG=true to enable verbose debug logging
  • Set EVMETRICS_DEBUG=false or unset to disable debug logging
  • Default behavior: debug logging is disabled unless explicitly enabled

Enhanced Debug Logging

  • Added detailed debug logs in RecordSubmissionAttempt method
  • Added debug logs in RecordSubmissionDaHeight method
  • Logs include chain ID, submission type, heights, and timestamps
  • Separate logging for successful vs failed attempts
  • All debug logging respects the EVMETRICS_DEBUG environment variable

Code Changes

  1. pkg/metrics/metrics.go:

    • Added isDebugEnabled() helper function
    • Added new metric fields to Metrics struct
    • Initialized new metrics in NewWithRegistry
    • Added RecordSubmissionAttempt method
    • Enhanced existing debug logging with environment variable control
  2. pkg/exporters/verifier/verifier.go:

    • Updated onVerified method to call RecordSubmissionAttempt
    • Both successful and failed submissions now tracked
  3. pkg/metrics/metrics_test.go:

    • Added comprehensive tests for new RecordSubmissionAttempt functionality
    • Tests cover both successful and failed submission scenarios

Benefits

Before: When submissions failed, there was no visibility into:

  • Whether submissions were being attempted at all
  • How many attempts were made
  • When the last attempt occurred
  • Why submissions might be failing

After: Now you get complete visibility:

  • See all submission attempts, even if they fail
  • Track failure rates and patterns
  • Monitor timestamps to identify stalls or delays
  • Debug logs provide immediate feedback (when enabled)
  • Environment variable control for production vs debugging modes

Usage

Enable debug logging:

export EVMETRICS_DEBUG=true
./ev-metrics

Disable debug logging (default):

unset EVMETRICS_DEBUG
./ev-metrics

Example Metrics

Working submissions:

ev_metrics_submission_attempts_total{chain_id="yourchain",type="header"} 10
ev_metrics_submission_failures_total{chain_id="yourchain",type="header"} 0
ev_metrics_last_submission_attempt_time{chain_id="yourchain",type="header"} 1766418175
ev_metrics_last_successful_submission_time{chain_id="yourchain",type="header"} 1766418175

Failing submissions:

ev_metrics_submission_attempts_total{chain_id="yourchain",type="data"} 5
ev_metrics_submission_failures_total{chain_id="yourchain",type="data"} 5
ev_metrics_last_submission_attempt_time{chain_id="yourchain",type="data"} 1766418176
ev_metrics_last_successful_submission_time{chain_id="yourchain",type="data"} 0

Debug Log Examples (when EVMETRICS_DEBUG=true)

DEBUG: Successful submission - chain: yourchain, type: header, timestamp: 1766418175
DEBUG: Failed submission attempt - chain: yourchain, type: data, timestamp: 1766418176
DEBUG: Recorded header submission DA height - chain: yourchain, height: 100
DEBUG: Data DA height 50 not higher than previous 75 for chain yourchain

Testing

  • All existing tests continue to pass
  • Added comprehensive tests for new functionality
  • Test coverage includes both success and failure scenarios
  • Edge cases handled (metrics with zero values, etc.)

Breaking Changes

None - This is a purely additive change that enhances observability without modifying existing behavior.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @auricom, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the monitoring capabilities for the data availability (DA) submission process by introducing a set of new Prometheus metrics. These metrics provide granular insights into the total number of submission attempts, distinguish between successful and failed submissions, and track the timestamps of these critical events. The changes aim to improve system observability, making it easier to monitor performance and troubleshoot issues related to DA submissions.

Highlights

  • New Submission Attempt Metrics: Introduced new Prometheus metrics to track total submission attempts, failed attempts, and the timestamps of the last attempt and last successful attempt, enhancing observability for the data availability (DA) submission process.
  • Integration into Verifier: The onVerified function in the verifier (pkg/exporters/verifier/verifier.go) now records submission attempts (success or failure) using the newly added metrics.
  • Enhanced Debug Logging: Added debug logging for DA height recording and submission attempts within pkg/metrics/metrics.go to provide more detailed insights into the system's behavior.
  • Comprehensive Testing: Included a new test suite (TestMetrics_RecordSubmissionAttempt in pkg/metrics/metrics_test.go) to thoroughly validate the functionality of the new submission attempt metrics and their recording logic, covering both successful and failed scenarios.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces new Prometheus metrics to track DA submission attempts, including total attempts, failures, and timestamps for the last attempt and last successful submission. The changes are logical and the new metrics will provide valuable insights. My review focuses on improving logging consistency and test code quality.

Specifically, I've pointed out that the new debug logs use the standard log package, which is inconsistent with the project's use of zerolog. I've also suggested refactoring the new test function in pkg/metrics/metrics_test.go to improve its structure and reduce code duplication, which will make it more maintainable.

Comment on lines 357 to 362
if success {
m.LastSuccessfulSubmissionTime.WithLabelValues(chainID, submissionType).Set(float64(now.Unix()))
log.Printf("DEBUG: Successful submission - chain: %s, type: %s, timestamp: %d", chainID, submissionType, now.Unix())
} else {
log.Printf("DEBUG: Failed submission attempt - chain: %s, type: %s, timestamp: %d", chainID, submissionType, now.Unix())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of log.Printf for debugging introduces an inconsistency with the project's logging approach, which appears to favor the structured logger zerolog (as seen in pkg/exporters/verifier/verifier.go). Using the standard log package can be problematic as it lacks levels and structured output.

For better consistency and maintainability, please consider one of the following:

  • Replace these log.Printf calls with a proper zerolog logger instance. You might need to add a logger field to the Metrics struct or pass it into the methods.
  • If these logs are for temporary debugging purposes, they should be removed before this PR is merged.

This comment applies to all new log.Printf calls added in this file.

Comment on lines +566 to +688
func TestMetrics_RecordSubmissionAttempt(t *testing.T) {
reg := prometheus.NewRegistry()
m := NewWithRegistry("test", reg)

// Test successful submission
beforeAttempt := time.Now()
m.RecordSubmissionAttempt("testchain", "header", true)
afterAttempt := time.Now()

// Verify counters
attempts := getMetricValue(t, reg, "test_submission_attempts_total", map[string]string{
"chain_id": "testchain",
"type": "header",
})
require.Equal(t, float64(1), attempts, "should have 1 attempt")

// For successful submission, failures should be 0 (metric may not be exported if 0)
// We'll check if the metric exists and has value 0, or doesn't exist (both are valid)
failuresMetricFound := false
failures := float64(0)
metrics, err := reg.Gather()
require.NoError(t, err)
for _, mf := range metrics {
if mf.GetName() == "test_submission_failures_total" {
for _, m := range mf.GetMetric() {
// check if labels match
match := true
for _, label := range m.GetLabel() {
if expectedVal, ok := map[string]string{"chain_id": "testchain", "type": "header"}[label.GetName()]; ok {
if label.GetValue() != expectedVal {
match = false
break
}
}
}
if match && len(m.GetLabel()) == 2 {
failuresMetricFound = true
if m.GetCounter() != nil {
failures = m.GetCounter().GetValue()
}
break
}
}
}
}
if failuresMetricFound {
require.Equal(t, float64(0), failures, "should have 0 failures")
} // else: metric not exported because value is 0, which is expected behavior

// Verify timestamps are within expected range
lastAttemptTime := getMetricValue(t, reg, "test_last_submission_attempt_time", map[string]string{
"chain_id": "testchain",
"type": "header",
})
require.GreaterOrEqual(t, lastAttemptTime, float64(beforeAttempt.Unix()))
require.LessOrEqual(t, lastAttemptTime, float64(afterAttempt.Unix()))

lastSuccessTime := getMetricValue(t, reg, "test_last_successful_submission_time", map[string]string{
"chain_id": "testchain",
"type": "header",
})
require.GreaterOrEqual(t, lastSuccessTime, float64(beforeAttempt.Unix()))
require.LessOrEqual(t, lastSuccessTime, float64(afterAttempt.Unix()))

// Test failed submission
beforeFailure := time.Now()
m.RecordSubmissionAttempt("testchain", "data", false)
afterFailure := time.Now()

// Verify counters
attempts = getMetricValue(t, reg, "test_submission_attempts_total", map[string]string{
"chain_id": "testchain",
"type": "data",
})
require.Equal(t, float64(1), attempts, "should have 1 attempt for data")

failures = getMetricValue(t, reg, "test_submission_failures_total", map[string]string{
"chain_id": "testchain",
"type": "data",
})
require.Equal(t, float64(1), failures, "should have 1 failure for data")

// Verify timestamps - should have attempt time but not success time for failed submission
lastAttemptTime = getMetricValue(t, reg, "test_last_submission_attempt_time", map[string]string{
"chain_id": "testchain",
"type": "data",
})
require.GreaterOrEqual(t, lastAttemptTime, float64(beforeFailure.Unix()))
require.LessOrEqual(t, lastAttemptTime, float64(afterFailure.Unix()))

// Last successful submission time should still be 0 for data type (never succeeded)
// Gauge metrics with 0 values may not be exported, so we need to check if it exists
var lastSuccessTimeData float64
var successMetricFoundData bool
metrics, err = reg.Gather()
require.NoError(t, err)
for _, mf := range metrics {
if mf.GetName() == "test_last_successful_submission_time" {
for _, m := range mf.GetMetric() {
// check if labels match
match := true
for _, label := range m.GetLabel() {
if expectedVal, ok := map[string]string{"chain_id": "testchain", "type": "data"}[label.GetName()]; ok {
if label.GetValue() != expectedVal {
match = false
break
}
}
}
if match && len(m.GetLabel()) == 2 {
successMetricFoundData = true
if m.GetGauge() != nil {
lastSuccessTimeData = m.GetGauge().GetValue()
}
break
}
}
}
}
if successMetricFoundData {
require.Equal(t, float64(0), lastSuccessTimeData, "should have no successful submission time for data")
} // else: metric not exported because value is 0, which is expected behavior
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This test function is quite long and contains significant code duplication, especially for verifying metrics that might not be present (e.g., when their value is 0). This makes the test harder to read and maintain.

I suggest refactoring it to improve clarity and reusability:

  1. Use Sub-tests: Split the logic for successful and failed submissions into separate sub-tests using t.Run(). This will improve test organization and make test failures easier to diagnose.

  2. Create a Helper for Optional Metrics: The logic to check for metrics that might be zero (and thus not exported by Prometheus) is repeated. You could extract this into a helper function, for example getMetricValueOptional, that returns (float64, bool) to indicate if the metric was found. This would greatly simplify the test body.

Here's a sketch of how the refactored test could look:

func TestMetrics_RecordSubmissionAttempt(t *testing.T) {
	reg := prometheus.NewRegistry()
	m := NewWithRegistry("test", reg)

	t.Run("successful submission", func(t *testing.T) {
		beforeAttempt := time.Now()
		m.RecordSubmissionAttempt("testchain", "header", true)
		afterAttempt := time.Now()

		attempts := getMetricValue(t, reg, "test_submission_attempts_total", map[string]string{"chain_id": "testchain", "type": "header"})
		require.Equal(t, float64(1), attempts)

		// Using a new helper for optional metrics
		failures, found := getMetricValueOptional(t, reg, "test_submission_failures_total", map[string]string{"chain_id": "testchain", "type": "header"})
		if found {
			require.Equal(t, float64(0), failures)
		}

		// ... timestamp checks
	})

	t.Run("failed submission", func(t *testing.T) {
		// ... similar logic for failed case
	})
}

@auricom auricom force-pushed the claude/submission_fix branch from 7527aa3 to 5ecb0d0 Compare December 22, 2025 16:17
@auricom auricom changed the title wip feat(metrics): add comprehensive submission debugging metrics with env control Dec 22, 2025
@auricom auricom merged commit d9c4ae2 into main Dec 22, 2025
3 checks passed
auricom added a commit that referenced this pull request Dec 22, 2025
auricom added a commit that referenced this pull request Dec 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants