Add auto-disable for explorers with unreachable RPCs by antoinedc · Pull Request #437 · tryethernal/ethernal

antoinedc · 2026-01-09T15:20:15Z

User description

Summary

Adds automatic sync deactivation after 3 consecutive RPC failures or PM2 timeouts
Reduces unnecessary PM2 server load from explorers with unreachable RPCs
Implements exponential backoff recovery checks (5m → 15m → 1h → 6h)
Resets failure tracking when sync is manually re-enabled

Changes

New files:

run/migrations/20260109161142-add-sync-failure-tracking.js - Adds tracking columns to explorers table
run/jobs/syncRecoveryCheck.js - Periodic recovery check job with exponential backoff
run/tests/jobs/syncRecoveryCheck.test.js - Tests for recovery job

Modified files:

run/models/explorer.js - New fields and helper methods for failure tracking
run/jobs/updateExplorerSyncingProcess.js - Track failures and auto-disable
run/jobs/index.js - Register new job
run/scheduler.js - Schedule recovery job (runs every 5 minutes)

How it works

Detection: When RPC is unreachable or PM2 times out, increment syncFailedAttempts
Auto-disable: After 3 failures, set shouldSync=false with reason
Recovery: syncRecoveryCheck tests disabled explorers and re-enables when RPC is back
Backoff: Recovery checks use exponential backoff to avoid hammering dead RPCs
Manual reset: Re-enabling sync via UI resets all failure tracking

Test plan

Run updateExplorerSyncingProcess tests - 16 passed
Run syncRecoveryCheck tests - 5 passed
Run all job tests - 228 passed
Run migration on staging
Monitor logs for auto-disable events

🤖 Generated with Claude Code

CodeAnt-AI Description

Auto-disable explorer sync after repeated RPC or PM2 failures and automatically recover with exponential backoff

What Changed

Explorers are automatically disabled for syncing after 3 consecutive RPC unreachable errors or PM2 timeouts; the system records failure counts and a disable reason.
When an explorer is auto-disabled, a recovery schedule is created and a new periodic recovery job checks disabled explorers and re-enables syncing when the explorer's RPC responds.
Recovery checks use exponential backoff intervals (5m → 15m → 1h → 6h) to avoid frequent retries; each recovery attempt updates the next scheduled check.
Manually enabling sync or successfully starting/resuming a sync process resets failure counters and clears the disabled state.
Database migration adds columns to track failure count, disable timestamp, disable reason, and next recovery check; the scheduler now runs the recovery job every 5 minutes.

Impact

✅ Fewer PM2 restarts and related server load
✅ Fewer repeated sync attempts against unreachable RPCs
✅ Automatic re-enable when an explorer's RPC recovers

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

When an explorer's RPC becomes unreachable, the explorerSyncCheck job keeps trying to start PM2 processes that immediately fail. This creates unnecessary load on the PM2 server. This change adds automatic sync deactivation after 3 consecutive failures: - Track sync failures on Explorer model (syncFailedAttempts, syncDisabledAt, syncDisabledReason, nextRecoveryCheckAt) - Auto-disable shouldSync after 3 consecutive RPC failures or PM2 timeouts - Add syncRecoveryCheck job that periodically tests disabled explorers - Use exponential backoff for recovery checks (5m -> 15m -> 1h -> 6h) - Reset failure tracking when sync is manually re-enabled Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

codeant-ai · 2026-01-09T15:20:21Z

CodeAnt AI is reviewing your PR.

Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

codeant-ai · 2026-01-09T15:23:48Z

Nitpicks 🔍

🔒 No security issues identified
⚡ Recommended areas for review Null access on reset The handler immediately uses `explorer.slug` when `data.reset` is true without verifying that `explorer` exists. If `data.reset` can be set for explorers that do not exist (or if `Explorer.findOne` returned null), accessing `explorer.slug` will throw. Ensure `explorer` is validated before using its properties or handle the reset case differently. Race condition `incrementSyncFailures` computes the new count using the instance property, then calls `update` — this read-modify-write is not atomic. Concurrent jobs can overwrite each other's increments and delay auto-disable. Consider using a DB atomic increment (e.g., `Model.increment`) or a transaction to avoid lost updates. Race condition incrementSyncFailures performs a read-modify-write on `syncFailedAttempts` using `this.syncFailedAttempts` and then writes the new value. Concurrent failure events can cause lost updates (multiple workers incrementing at the same time). Use an atomic DB increment (or transaction) to avoid missed increments and duplicate/late auto-disable behavior. Missing workspace/rpc check The code uses `explorer.workspace.rpcServer` without guarding for `explorer.workspace` or missing `rpcServer`. If the include returns no workspace (or rpcServer is null/empty) this will throw and go to the catch block. That may hide the real issue and schedule unnecessary retries. Validate presence of `workspace` and `rpcServer` and log/skip if missing. Redundant Timeout The job calls `withTimeout(provider.fetchLatestBlock())` but `ProviderConnector.fetchLatestBlock()` already wraps the provider call with `withTimeout`. Double-wrapping can lead to confusing timeout behavior (two nested timers, unexpected error messages or different timeout values). Consider calling `provider.fetchLatestBlock()` directly or passing a single explicit timeout.

codeant-ai · 2026-01-09T15:24:19Z

PR Code Suggestions ✨

Category	Suggestion	Score
Logic error	Only the failure counter is reset on recovery, leaving other sync failure metadata stale and inconsistent with the actual sync state On successful (re)start of syncing you only reset `syncFailedAttempts` while leaving other failure-tracking fields like `syncDisabledAt`, `syncDisabledReason`, and `nextRecoveryCheckAt` untouched, which can result in explorers being marked as "disabled" or "in backoff" in the database even though syncing has recovered; using the existing `resetSyncState` helper keeps the sync failure state consistent. run/jobs/updateExplorerSyncingProcess.js [85-101] else if (explorer.shouldSync && !existingProcess) { await pm2.start(explorer.slug, explorer.workspaceId); - // Reset failure counter on successful start + // Reset all sync failure state on successful start if (explorer.syncFailedAttempts > 0) { - await explorer.update({ syncFailedAttempts: 0 }); + await explorer.resetSyncState(); } return 'Process started.'; } else if (explorer.shouldSync && existingProcess && existingProcess.pm2_env.status == 'stopped') { await pm2.resume(explorer.slug, explorer.workspaceId); - // Reset failure counter on successful resume + // Reset all sync failure state on successful resume if (explorer.syncFailedAttempts > 0) { - await explorer.update({ syncFailedAttempts: 0 }); + await explorer.resetSyncState(); } return 'Process resumed.'; } Suggestion importance[1-10]: 10 Why it matters? 🤔: The Explorer model includes a resetSyncState() helper that clears all failure-tracking fields (syncFailedAttempts, syncDisabledAt, syncDisabledReason, nextRecoveryCheckAt). The PR currently only resets syncFailedAttempts on successful recovery which can leave the explorer marked as disabled/backoff in the DB. Using resetSyncState resolves a real logic inconsistency between DB state and actual process status, not just a cosmetic tweak.	10
Race condition	Non-atomic manual increment of the failure counter can lose updates under concurrent calls and delay or prevent auto-disabling The sync failure counter is incremented by reading the current value on the instance and then writing back an incremented value, which is not atomic; if multiple jobs call this concurrently, increments can be lost and the failure count may never reach the threshold, preventing or delaying automatic disabling of sync as intended. Using a database-level atomic increment avoids this race condition. run/models/explorer.js [333-347] /** * Increments the sync failure counter and auto-disables if threshold reached. * @param {string} [reason='rpc_unreachable'] - Reason for the failure * @returns {Promise<{disabled: boolean, attempts: number}>} Result with disable status */ async incrementSyncFailures(reason = 'rpc_unreachable') { - const newCount = (this.syncFailedAttempts \|\| 0) + 1; - await this.update({ syncFailedAttempts: newCount }); + const updatedExplorer = await this.increment('syncFailedAttempts'); + const newCount = updatedExplorer.syncFailedAttempts; if (newCount >= SYNC_FAILURE_THRESHOLD) { await this.autoDisableSync(reason); return { disabled: true, attempts: newCount }; } return { disabled: false, attempts: newCount }; } Suggestion importance[1-10]: 10 Why it matters? 🤔: The suggestion correctly identifies a real race: the current code reads this.syncFailedAttempts in JS, increments, then writes it back — concurrent workers can overwrite each other's increments and cause lost updates so the threshold may never be reached. Sequelize provides an atomic increment (instance.increment / model.increment) that performs the update at the DB level, avoiding that race. This is a functional fix (not cosmetic) and directly tied to the PR changes that introduced sync failure tracking.	10

codeant-ai · 2026-01-09T15:24:24Z

CodeAnt AI finished reviewing your PR.

- Fix race condition: use atomic increment for syncFailedAttempts - Add null check for explorer.workspace in updateExplorerSyncingProcess - Simplify backoff calculation using recoveryAttempts counter - Add max recovery attempts (10) after which manual intervention required - Add batching (50 explorers) to syncRecoveryCheck job - Add random jitter (up to 2 min) to stagger recovery checks - Remove dead code (resetSyncState method) - Add recoveryAttempts column to migration Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The syncRecoveryCheck job was registered in jobs/index.js and scheduled in scheduler.js, but was missing from the priorities list which caused a crash on startup when trying to enqueue to a non-existent queue. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

codeant-ai bot added the size:XL This PR changes 500-999 lines, ignoring generated files label Jan 9, 2026

Antoine de Chevigné and others added 2 commits January 9, 2026 16:38

antoinedc merged commit 98c9de1 into develop Jan 9, 2026
8 checks passed

antoinedc deleted the feature/auto-disable-sync-on-rpc-failures branch January 9, 2026 16:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add auto-disable for explorers with unreachable RPCs#437

Add auto-disable for explorers with unreachable RPCs#437
antoinedc merged 3 commits intodevelopfrom
feature/auto-disable-sync-on-rpc-failures

antoinedc commented Jan 9, 2026 •

edited by codeant-ai bot

Loading

Uh oh!

codeant-ai bot commented Jan 9, 2026

Uh oh!

codeant-ai bot commented Jan 9, 2026

Uh oh!

codeant-ai bot commented Jan 9, 2026

Uh oh!

codeant-ai bot commented Jan 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

antoinedc commented Jan 9, 2026 • edited by codeant-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Summary

Changes

How it works

Test plan

CodeAnt-AI Description

What Changed

Impact

Checking Your Pull Request

Talking to CodeAnt AI

Example

Preserve Org Learnings with CodeAnt

Example

Retrigger review

Check Your Repository Health

Uh oh!

codeant-ai bot commented Jan 9, 2026

Thanks for using CodeAnt! 🎉

Uh oh!

codeant-ai bot commented Jan 9, 2026

Nitpicks 🔍

Uh oh!

codeant-ai bot commented Jan 9, 2026

PR Code Suggestions ✨

Uh oh!

codeant-ai bot commented Jan 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

antoinedc commented Jan 9, 2026 •

edited by codeant-ai bot

Loading