Skip to content

Add auto-disable for explorers with unreachable RPCs#437

Merged
antoinedc merged 3 commits intodevelopfrom
feature/auto-disable-sync-on-rpc-failures
Jan 9, 2026
Merged

Add auto-disable for explorers with unreachable RPCs#437
antoinedc merged 3 commits intodevelopfrom
feature/auto-disable-sync-on-rpc-failures

Conversation

@antoinedc
Copy link
Member

@antoinedc antoinedc commented Jan 9, 2026

User description

Summary

  • Adds automatic sync deactivation after 3 consecutive RPC failures or PM2 timeouts
  • Reduces unnecessary PM2 server load from explorers with unreachable RPCs
  • Implements exponential backoff recovery checks (5m → 15m → 1h → 6h)
  • Resets failure tracking when sync is manually re-enabled

Changes

New files:

  • run/migrations/20260109161142-add-sync-failure-tracking.js - Adds tracking columns to explorers table
  • run/jobs/syncRecoveryCheck.js - Periodic recovery check job with exponential backoff
  • run/tests/jobs/syncRecoveryCheck.test.js - Tests for recovery job

Modified files:

  • run/models/explorer.js - New fields and helper methods for failure tracking
  • run/jobs/updateExplorerSyncingProcess.js - Track failures and auto-disable
  • run/jobs/index.js - Register new job
  • run/scheduler.js - Schedule recovery job (runs every 5 minutes)

How it works

  1. Detection: When RPC is unreachable or PM2 times out, increment syncFailedAttempts
  2. Auto-disable: After 3 failures, set shouldSync=false with reason
  3. Recovery: syncRecoveryCheck tests disabled explorers and re-enables when RPC is back
  4. Backoff: Recovery checks use exponential backoff to avoid hammering dead RPCs
  5. Manual reset: Re-enabling sync via UI resets all failure tracking

Test plan

  • Run updateExplorerSyncingProcess tests - 16 passed
  • Run syncRecoveryCheck tests - 5 passed
  • Run all job tests - 228 passed
  • Run migration on staging
  • Monitor logs for auto-disable events

🤖 Generated with Claude Code


CodeAnt-AI Description

Auto-disable explorer sync after repeated RPC or PM2 failures and automatically recover with exponential backoff

What Changed

  • Explorers are automatically disabled for syncing after 3 consecutive RPC unreachable errors or PM2 timeouts; the system records failure counts and a disable reason.
  • When an explorer is auto-disabled, a recovery schedule is created and a new periodic recovery job checks disabled explorers and re-enables syncing when the explorer's RPC responds.
  • Recovery checks use exponential backoff intervals (5m → 15m → 1h → 6h) to avoid frequent retries; each recovery attempt updates the next scheduled check.
  • Manually enabling sync or successfully starting/resuming a sync process resets failure counters and clears the disabled state.
  • Database migration adds columns to track failure count, disable timestamp, disable reason, and next recovery check; the scheduler now runs the recovery job every 5 minutes.

Impact

✅ Fewer PM2 restarts and related server load
✅ Fewer repeated sync attempts against unreachable RPCs
✅ Automatic re-enable when an explorer's RPC recovers

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

When an explorer's RPC becomes unreachable, the explorerSyncCheck job
keeps trying to start PM2 processes that immediately fail. This creates
unnecessary load on the PM2 server.

This change adds automatic sync deactivation after 3 consecutive failures:

- Track sync failures on Explorer model (syncFailedAttempts, syncDisabledAt,
  syncDisabledReason, nextRecoveryCheckAt)
- Auto-disable shouldSync after 3 consecutive RPC failures or PM2 timeouts
- Add syncRecoveryCheck job that periodically tests disabled explorers
- Use exponential backoff for recovery checks (5m -> 15m -> 1h -> 6h)
- Reset failure tracking when sync is manually re-enabled

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@codeant-ai
Copy link

codeant-ai bot commented Jan 9, 2026

CodeAnt AI is reviewing your PR.


Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

@codeant-ai codeant-ai bot added the size:XL This PR changes 500-999 lines, ignoring generated files label Jan 9, 2026
@codeant-ai
Copy link

codeant-ai bot commented Jan 9, 2026

Nitpicks 🔍

🔒 No security issues identified
⚡ Recommended areas for review

  • Null access on reset
    The handler immediately uses explorer.slug when data.reset is true without verifying that explorer exists. If data.reset can be set for explorers that do not exist (or if Explorer.findOne returned null), accessing explorer.slug will throw. Ensure explorer is validated before using its properties or handle the reset case differently.

  • Race condition
    incrementSyncFailures computes the new count using the instance property, then calls update — this read-modify-write is not atomic. Concurrent jobs can overwrite each other's increments and delay auto-disable. Consider using a DB atomic increment (e.g., Model.increment) or a transaction to avoid lost updates.

  • Race condition
    incrementSyncFailures performs a read-modify-write on syncFailedAttempts using this.syncFailedAttempts and then writes the new value. Concurrent failure events can cause lost updates (multiple workers incrementing at the same time). Use an atomic DB increment (or transaction) to avoid missed increments and duplicate/late auto-disable behavior.

  • Missing workspace/rpc check
    The code uses explorer.workspace.rpcServer without guarding for explorer.workspace or missing rpcServer. If the include returns no workspace (or rpcServer is null/empty) this will throw and go to the catch block. That may hide the real issue and schedule unnecessary retries. Validate presence of workspace and rpcServer and log/skip if missing.

  • Redundant Timeout
    The job calls withTimeout(provider.fetchLatestBlock()) but ProviderConnector.fetchLatestBlock() already wraps the provider call with withTimeout. Double-wrapping can lead to confusing timeout behavior (two nested timers, unexpected error messages or different timeout values). Consider calling provider.fetchLatestBlock() directly or passing a single explicit timeout.

@codeant-ai
Copy link

codeant-ai bot commented Jan 9, 2026

PR Code Suggestions ✨

CategorySuggestion                                                                                                                                    Score
Logic error
Only the failure counter is reset on recovery, leaving other sync failure metadata stale and inconsistent with the actual sync state

On successful (re)start of syncing you only reset syncFailedAttempts while leaving
other failure-tracking fields like syncDisabledAt, syncDisabledReason, and
nextRecoveryCheckAt untouched, which can result in explorers being marked as
"disabled" or "in backoff" in the database even though syncing has recovered; using
the existing resetSyncState helper keeps the sync failure state consistent.

run/jobs/updateExplorerSyncingProcess.js [85-101]

 else if (explorer.shouldSync && !existingProcess) {
     await pm2.start(explorer.slug, explorer.workspaceId);
-    // Reset failure counter on successful start
+    // Reset all sync failure state on successful start
     if (explorer.syncFailedAttempts > 0) {
-        await explorer.update({ syncFailedAttempts: 0 });
+        await explorer.resetSyncState();
     }
     return 'Process started.';
 }
 else if (explorer.shouldSync && existingProcess && existingProcess.pm2_env.status == 'stopped') {
     await pm2.resume(explorer.slug, explorer.workspaceId);
-    // Reset failure counter on successful resume
+    // Reset all sync failure state on successful resume
     if (explorer.syncFailedAttempts > 0) {
-        await explorer.update({ syncFailedAttempts: 0 });
+        await explorer.resetSyncState();
     }
     return 'Process resumed.';
 }
 
Suggestion importance[1-10]: 10

Why it matters? 🤔: The Explorer model includes a resetSyncState() helper that clears all failure-tracking fields (syncFailedAttempts, syncDisabledAt, syncDisabledReason, nextRecoveryCheckAt). The PR currently only resets syncFailedAttempts on successful recovery which can leave the explorer marked as disabled/backoff in the DB. Using resetSyncState resolves a real logic inconsistency between DB state and actual process status, not just a cosmetic tweak.

10
Race condition
Non-atomic manual increment of the failure counter can lose updates under concurrent calls and delay or prevent auto-disabling

The sync failure counter is incremented by reading the current value on the instance
and then writing back an incremented value, which is not atomic; if multiple jobs
call this concurrently, increments can be lost and the failure count may never reach
the threshold, preventing or delaying automatic disabling of sync as intended. Using
a database-level atomic increment avoids this race condition.

run/models/explorer.js [333-347]

 /**
  * Increments the sync failure counter and auto-disables if threshold reached.
  * @param {string} [reason='rpc_unreachable'] - Reason for the failure
  * @returns {Promise<{disabled: boolean, attempts: number}>} Result with disable status
  */
 async incrementSyncFailures(reason = 'rpc_unreachable') {
-    const newCount = (this.syncFailedAttempts || 0) + 1;
-    await this.update({ syncFailedAttempts: newCount });
+    const updatedExplorer = await this.increment('syncFailedAttempts');
+    const newCount = updatedExplorer.syncFailedAttempts;
 
     if (newCount >= SYNC_FAILURE_THRESHOLD) {
         await this.autoDisableSync(reason);
         return { disabled: true, attempts: newCount };
     }
     return { disabled: false, attempts: newCount };
 }
 
Suggestion importance[1-10]: 10

Why it matters? 🤔: The suggestion correctly identifies a real race: the current code reads this.syncFailedAttempts in JS, increments, then writes it back — concurrent workers can overwrite each other's increments and cause lost updates so the threshold may never be reached. Sequelize provides an atomic increment (instance.increment / model.increment) that performs the update at the DB level, avoiding that race. This is a functional fix (not cosmetic) and directly tied to the PR changes that introduced sync failure tracking.

10

@codeant-ai
Copy link

codeant-ai bot commented Jan 9, 2026

CodeAnt AI finished reviewing your PR.

Antoine de Chevigné and others added 2 commits January 9, 2026 16:38
- Fix race condition: use atomic increment for syncFailedAttempts
- Add null check for explorer.workspace in updateExplorerSyncingProcess
- Simplify backoff calculation using recoveryAttempts counter
- Add max recovery attempts (10) after which manual intervention required
- Add batching (50 explorers) to syncRecoveryCheck job
- Add random jitter (up to 2 min) to stagger recovery checks
- Remove dead code (resetSyncState method)
- Add recoveryAttempts column to migration

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The syncRecoveryCheck job was registered in jobs/index.js and scheduled in
scheduler.js, but was missing from the priorities list which caused a crash
on startup when trying to enqueue to a non-existent queue.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@antoinedc antoinedc merged commit 98c9de1 into develop Jan 9, 2026
8 checks passed
@antoinedc antoinedc deleted the feature/auto-disable-sync-on-rpc-failures branch January 9, 2026 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL This PR changes 500-999 lines, ignoring generated files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant