Skip to content

fix(): update CR with all tunnel status fields including latency#454

Merged
rajendra-avesha merged 1 commit intomasterfrom
feature-latency-update-fix
Nov 6, 2025
Merged

fix(): update CR with all tunnel status fields including latency#454
rajendra-avesha merged 1 commit intomasterfrom
feature-latency-update-fix

Conversation

@rajendra-avesha
Copy link
Contributor

fix(slicegateway): update CR with all tunnel status fields including latency

Enhanced isGWPodStatusChanged to compare ALL TunnelStatus fields:

  • Latency: Gateway latency in milliseconds
  • RxRate/TxRate: Throughput metrics
  • PacketLoss: Packet loss percentage
  • RemoteIP/LocalIP: Tunnel endpoint IPs
  • IntfName: Interface name
  • TunnelState: Tunnel state string

Description

This PR fixes a critical bug where gateway metrics (latency, packet loss, throughput) were not being updated in the SliceGateway Custom Resource after the initial tunnel establishment.

Problem

The isGWPodStatusChanged() function in controllers/slicegateway/utils.go was only checking two fields:

  • TunnelStatus.Status (tunnel state: UP/DOWN)
  • PeerPodName (remote gateway pod name)

This caused the function to return "unchanged" even when critical metrics like latency, packet loss, and throughput changed. As a result, the SliceGateway CR was only updated when the tunnel state changed (UP↔DOWN), effectively freezing all metrics after the initial tunnel establishment.

Root Cause

// BEFORE: Only checking 2 fields
func isGWPodStatusChanged(...) bool {
    return gw.TunnelStatus.Status == gwPod.TunnelStatus.Status && 
           gw.PeerPodName == gwPod.PeerPodName
}

The function returns true if these fields are unchanged, which when negated in isGatewayStatusChanged(), prevents CR updates. Since tunnel status rarely changes after being UP, metrics were never updated.

Solution

Enhanced the function to compare ALL TunnelStatus fields:

// AFTER: Checking all 10 fields
func isGWPodStatusChanged(...) bool {
    tunnelUnchanged := gw.TunnelStatus.Status == gwPod.TunnelStatus.Status &&
        gw.TunnelStatus.Latency == gwPod.TunnelStatus.Latency &&
        gw.TunnelStatus.RxRate == gwPod.TunnelStatus.RxRate &&
        gw.TunnelStatus.TxRate == gwPod.TunnelStatus.TxRate &&
        gw.TunnelStatus.PacketLoss == gwPod.TunnelStatus.PacketLoss &&
        gw.TunnelStatus.RemoteIP == gwPod.TunnelStatus.RemoteIP &&
        gw.TunnelStatus.LocalIP == gwPod.TunnelStatus.LocalIP &&
        gw.TunnelStatus.IntfName == gwPod.TunnelStatus.IntfName &&
        gw.TunnelStatus.TunnelState == gwPod.TunnelStatus.TunnelState

    peerUnchanged := gw.PeerPodName == gwPod.PeerPodName

    return tunnelUnchanged && peerUnchanged
}

Now the CR updates whenever ANY metric changes, ensuring fresh monitoring data every reconciliation cycle (120 seconds).

Impact

Before Fix:

  • ❌ Latency field missing or showing 0 in CR
  • ❌ PacketLoss never updated
  • ❌ RxRate/TxRate frozen at initial values
  • ❌ RemoteIP field missing
  • ❌ Metrics-based monitoring ineffective

After Fix:

  • ✅ Latency updates every 120 seconds (shows actual values like 1ms)
  • ✅ PacketLoss updates with actual percentage
  • ✅ RxRate/TxRate update dynamically with traffic changes
  • ✅ RemoteIP present in CR
  • ✅ All metrics refresh during reconciliation

Fields Now Being Monitored

Field Type Description Example Value
Latency int32 Gateway latency in milliseconds 1
RxRate int64 Receive throughput in bits/ms 8
TxRate int64 Transmit throughput in bits/ms 8
PacketLoss int32 Packet loss percentage 0
RemoteIP string Remote tunnel endpoint IP "10.70.255.2"
LocalIP string Local tunnel endpoint IP "10.70.255.1"
IntfName string Interface name "tun0"
TunnelState string Tunnel state string "UP"
Status int32 Tunnel status integer 1

Fixes #[issue-number]

How Has This Been Tested?

Test Environment

  • Clusters: Linode LKE (us-mia-1, us-mia-2)
  • Configuration:
    • Single-network slice: worker-single-net
    • Multi-network slice: worker-multi-net
    • 2 worker clusters with 4 gateway pods per slice
  • Operator Version: aveshadev/worker-operator-ent-egs:1.16.0-SNAPSHOT-228fd10c

Test Cases

  • Test Case A: Latency Field Population

    • Verified latency field appears in CR after tunnel establishment
    • Confirmed value matches sidecar measurements (1ms)
    • Result: ✅ PASS
  • Test Case B: Metrics Update Frequency

    • Monitored CR updates over multiple reconciliation cycles
    • Verified metrics refresh every 120 seconds
    • Result: ✅ PASS
  • Test Case C: PacketLoss Tracking

    • Verified PacketLoss field present when tunnel UP (0%)
    • Verified PacketLoss field present when tunnel DOWN (100%)
    • Result: ✅ PASS
  • Test Case D: Throughput Metrics

    • Verified RxRate and TxRate update with traffic changes
    • Observed dynamic values (8-24 bps) across different gateways
    • Result: ✅ PASS
  • Test Case E: RemoteIP Field

    • Verified RemoteIP present in all SliceGateway CRs
    • Confirmed correct peer IP addresses
    • Result: ✅ PASS
  • Test Case F: Single-Network Mode

    • Tested on worker-single-net slice
    • All metrics updating correctly
    • Result: ✅ PASS
  • Test Case G: Multi-Network Mode

    • Tested on worker-multi-net slice
    • All metrics updating correctly
    • Result: ✅ PASS
  • Test Case H: Both Worker Clusters

    • Validated on worker-1 (controller + worker)
    • Validated on worker-2
    • Both showing correct metrics
    • Result: ✅ PASS

Verification Commands

# Check SliceGateway CR status
kubectl get slicegateway <gateway-name> -n kubeslice-system \
  -o jsonpath='{.status.gatewayPodStatus[*].tunnelStatus}' | jq

# Expected output (after fix):
{
  "IntfName": "tun0",
  "Latency": 1,              # ← Now present!
  "LocalIP": "10.70.255.1",
  "PacketLoss": 0,           # ← Now present!
  "RemoteIP": "10.70.255.2", # ← Now present!
  "RxRate": 8,               # ← Now updating!
  "Status": 1,
  "TunnelState": "UP",
  "TxRate": 8                # ← Now updating!
}

Test Results Summary

Metric Before Fix After Fix Status
Latency ❌ Missing ✅ Present (1ms) FIXED
PacketLoss ❌ Missing ✅ Present (0%) FIXED
RemoteIP ❌ Missing ✅ Present FIXED
RxRate Updates ❌ Frozen ✅ Dynamic FIXED
TxRate Updates ❌ Frozen ✅ Dynamic FIXED
CR Update Frequency ❌ Only on state change ✅ Every 120s FIXED

Evidence from Sidecar Logs

Before Fix:

sidecar logs: Latency :1, Packet Loss:0
CR status:    Latency: <missing>

After Fix:

sidecar logs: Latency :1, Packet Loss:0
CR status:    Latency: 1, PacketLoss: 0

Checklist:

  • The title of the PR states what changed and the related issues number (used for the release note).
  • Does this PR requires documentation updates?
  • I've updated documentation as required by this PR.
  • I have ran go fmt
  • I have updated the helm chart as required by this PR. (N/A - no helm chart changes needed)
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have tested it for all user roles.
  • I have added all the required unit test cases. (Integration testing performed)
  • I have verified the E2E test cases with new code changes.
  • I have added all the required E2E test cases. (Existing E2E tests cover this scenario)

Does this PR introduce a breaking change?

NO - This PR does not introduce any breaking changes. It is a bug fix that enhances existing functionality without changing any APIs or interfaces.

All existing deployments will benefit from this fix immediately upon upgrade with no migration or configuration changes required.

Fixed critical bug where gateway metrics (latency, packet loss, throughput) were not updating in SliceGateway Custom Resource after initial tunnel establishment. The isGWPodStatusChanged function now properly compares all TunnelStatus fields, ensuring metrics update every 120 seconds during reconciliation. This fix enables proper monitoring and alerting based on gateway performance metrics.

Additional Notes

Backward Compatibility

  • ✅ Fully backward compatible
  • ✅ No API changes
  • ✅ No new fields added to CRD
  • ✅ No configuration changes required
  • ✅ Existing deployments benefit immediately

Performance Impact

  • API Server Load: Minimal - 1 additional CR update per gateway per 120 seconds (only when metrics change)
  • Operator CPU/Memory: No measurable impact
  • Network: No additional traffic (metrics already being collected)

Monitoring Recommendations

After deploying this fix, users can now:

  1. Monitor gateway latency trends using CR metrics
  2. Set up alerts for packet loss thresholds
  3. Track throughput metrics for capacity planning
  4. Use metrics-based health checks for gateways

For real-time metrics (< 120s granularity), continue using Prometheus endpoints on gateway sidecars (port 18080).

Documentation Updates

Documentation should be updated to reflect:

  • Gateway metrics now reliably update in CRs
  • Metric update frequency is 120 seconds (reconciliation interval)
  • All TunnelStatus fields are now maintained in CR

@rajendra-avesha rajendra-avesha force-pushed the feature-latency-update-fix branch from 908c2e5 to 9e02b28 Compare November 6, 2025 05:55
Enhanced isGWPodStatusChanged to compare ALL TunnelStatus fields:
- Latency: Gateway latency in milliseconds
- RxRate/TxRate: Throughput metrics
- PacketLoss: Packet loss percentage
- RemoteIP/LocalIP: Tunnel endpoint IPs
- IntfName: Interface name
- TunnelState: Tunnel state string

Signed-off-by: Rajendra <rajendra@aveshasystems.com>
@rajendra-avesha rajendra-avesha force-pushed the feature-latency-update-fix branch from 9e02b28 to 8984466 Compare November 6, 2025 06:44
@rajendra-avesha rajendra-avesha merged commit 97ce2c3 into master Nov 6, 2025
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants