Skip to content

Conversation

@win5923
Copy link
Collaborator

@win5923 win5923 commented Nov 10, 2025

Why are these changes needed?

Currently, when users update a RayCluster spec (e.g., update the image), users must re-create the cluster or manually set spec.suspend to true and after all Pods are deleted and then set it back to false which is not particularly convenient for users deploying with GitOps systems like ArgoCD.

Ref:

Design doc: https://docs.google.com/document/d/1xQLm0-WQWD-FkufxBJYklOJGvVn4RLk0_vPjLD5ax7o/edit?usp=sharing

Changes

  • Add spec.upgradeStrategy field to RayCluster CRD
  • Supports two values:
    • Recreate: During upgrade, Recreate strategy will delete all existing pods before creating new ones.
    • None: No new pod will be created while the strategy is set to None

Implementation

  • Store hash of HeadGroupSpec.Template to head pod and workerGroup.Template to worker pod annotations during creation with ray.io/pod-template-hash
  • Compare stored hash with current head pod or worker pod template hash to detect changes and recreate all pods

I only compare the HeadGroupSpec.Template and workerGroup.Template because these define the pod-related configurations. The RayCluster.Spec contains many dynamic and component-specific settings (e.g., autoscaler configs, rayStartParams).

Example:

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: raycluster-kuberay
spec:
  upgradeStrategy:
    type: Recreate
  rayVersion: '2.48.0'

Related issue number

Closes #2534 #3905

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@win5923 win5923 marked this pull request as draft November 10, 2025 16:24
@win5923 win5923 force-pushed the raycluster-upgradeStrategy branch 6 times, most recently from 710166a to d261b0b Compare November 10, 2025 17:11
@win5923 win5923 changed the title [draft] Support recreate pods for RayCluster using RayClusterSpec [draft] Support recreate pods for RayCluster using RayClusterSpec.upgradeStrategy Nov 10, 2025
@win5923
Copy link
Collaborator Author

win5923 commented Nov 10, 2025

Hi @andrewsykim, I followed you previous comments to adding a spec.upgradeStrategy API to RayCluster. But for now. I'm concerned this approach may introduce some issues:

  1. Confusion with existing API: We already have upgradeStrategy for RayService. Adding another upgradeStrategy to RayCluster could be confusing for users and creates unclear separation of concerns.
  2. Breaking RayJob workflows: For RayJob, setting upgradeStrategy=Recreate on the RayCluster would cause pod recreation during job execution, leading to job interruption and loss of running jobs.

Maybe we can just add a feature gate instead of using spec.upgradeStrategy.type field in RayCluster to enable the recreate behavior. WDYT?

@andrewsykim
Copy link
Member

Maybe we can just add a feature gate instead of using spec.upgradeStrategy.type field in RayCluster to enable the recreate behavior. WDYT?

Feature gates are used to gate features that are in early development and not ready for wider adoption, it shouldn't be used to change the behavior of RayCluster because it will eventually be on by default (and forced on).

@andrewsykim
Copy link
Member

I think both of those concerns are valid, but I don't think this is a problem with separation of concerns as RayCluster is a building block for both RayService and RayJob. For those cases you mentioned, we should have validation to ensure RayCluster upgrade strategy cannot be set when used w/ RayJob and RayService

@win5923 win5923 force-pushed the raycluster-upgradeStrategy branch 7 times, most recently from 05b8108 to 7109cf1 Compare November 19, 2025 17:27
@win5923 win5923 changed the title [draft] Support recreate pods for RayCluster using RayClusterSpec.upgradeStrategy [Feature] Support recreate pods for RayCluster using RayClusterSpec.upgradeStrategy Nov 19, 2025
@win5923 win5923 force-pushed the raycluster-upgradeStrategy branch 2 times, most recently from 3d448e6 to 8bcce91 Compare November 19, 2025 18:26
@win5923 win5923 force-pushed the raycluster-upgradeStrategy branch from 8bcce91 to bf87764 Compare November 19, 2025 18:28
@win5923 win5923 marked this pull request as ready for review November 19, 2025 18:30
@win5923 win5923 force-pushed the raycluster-upgradeStrategy branch 2 times, most recently from c9d35b2 to 8d4c813 Compare November 20, 2025 17:03
@win5923 win5923 force-pushed the raycluster-upgradeStrategy branch from 99d114b to acd7739 Compare November 22, 2025 10:58
Signed-off-by: Rueian <rueiancsie@gmail.com>
Signed-off-by: win5923 <ken89@kimo.com>
@win5923 win5923 force-pushed the raycluster-upgradeStrategy branch 2 times, most recently from 0b50555 to 73332af Compare December 2, 2025 18:12
Signed-off-by: win5923 <ken89@kimo.com>
Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need some review help, cc @seanlaii @JiangJiaWei1103 @CheyuWu

@win5923 win5923 force-pushed the raycluster-upgradeStrategy branch from dcbfe9a to 67152e9 Compare December 6, 2025 13:58
Signed-off-by: win5923 <ken89@kimo.com>
@win5923 win5923 force-pushed the raycluster-upgradeStrategy branch from 67152e9 to 8c82e7b Compare December 6, 2025 13:59
@andrewsykim
Copy link
Member

Store hash of HeadGroupSpec.Template to head pod and workerGroup.Template to worker pod annotations during creation with ray.io/pod-template-hash

Something to be mindful of with this strategy is that adding new fields or slightly changing some property of the template (e.g. changing default entrypoint), could trigger an unintended recreate when a user upgrades KubeRay. I think we ran into something similar when dealing with upgrade behavior in RayService, see #2320 for reference.

NewCluster RayServiceUpgradeType = "NewCluster"
// No new cluster will be created while the strategy is set to None
None RayServiceUpgradeType = "None"
RayServiceUpgradeNone RayServiceUpgradeType = "None"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also rename two constants above to start with RayService*

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated in f3a9b81.


const (
// During upgrade, Recreate strategy will delete all existing pods before creating new ones
Recreate RayClusterUpgradeType = "Recreate"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
Recreate RayClusterUpgradeType = "Recreate"
RayClusterUpgradeRecreate RayClusterUpgradeType = "Recreate"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated in f3a9b81

@win5923 win5923 force-pushed the raycluster-upgradeStrategy branch from 08fe6c5 to f3a9b81 Compare December 7, 2025 06:55
@win5923
Copy link
Collaborator Author

win5923 commented Dec 7, 2025

Something to be mindful of with this strategy is that adding new fields or slightly changing some property of the template (e.g. changing default entrypoint), could trigger an unintended recreate when a user upgrades KubeRay. I think we ran into something similar when dealing with upgrade behavior in RayService, see #2320 for reference.

Thanks for pointing this out. I’ll add the ray.io/kuberay-version annotation to both the head and worker pods as well, so we can detect KubeRay version changes and avoid unintended recreates. I really appreciate your comment.

Updated: I think we don't need to check whether the KubeRayVersion is same or not, cause we only compare HeadGroupSpec.Template and WorkerGroupSpecs.Template, which are just corev1.PodTemplateSpec. This comparison is independent of the KubeRay Operator version.

@win5923 win5923 force-pushed the raycluster-upgradeStrategy branch from a59d6f9 to 5f3afb3 Compare December 7, 2025 15:10
@CheyuWu CheyuWu self-requested a review December 7, 2025 16:42
Co-authored-by: Nary Yeh <60069744+machichima@users.noreply.github.com>
Signed-off-by: Jun-Hao Wan <ken89@kimo.com>
@win5923 win5923 force-pushed the raycluster-upgradeStrategy branch from f2e53cf to 9c06b5f Compare December 8, 2025 13:14
Comment on lines 1141 to 1153
for _, pod := range allPods.Items {
podVersion := pod.Annotations[utils.KubeRayVersion]
if podVersion != "" && podVersion != utils.KUBERAY_VERSION {
logger.Info("Pods have different KubeRay version, updating pod annotations",
"pod", pod.Name,
"podVersion", podVersion,
"currentVersion", utils.KUBERAY_VERSION)
if err := r.updatePodsAnnotations(ctx, instance, &allPods); err != nil {
logger.Error(err, "Failed to update pod annotations for KubeRay version change")
}
return false
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite understand this part. When the KubeRayVersion updated, we will update the pods' kuberay version annotation, then what will happen? It seems like we do not have extra handling when kuberay version update. If so, should we just remove it? Then I think we do not need updatePodsAnnotations and calculatePodTemplateHash.

Copy link
Collaborator Author

@win5923 win5923 Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial thought was to follow the approach used in RayService:
when the KubeRay Operator is upgraded, the reconciliation logic entering shouldRecreatePodsForUpgrade would first check whether the KubeRayVersion has changed. If it differed, we would compute a new pod template hash and update both ray.io/pod-template-hash and ray.io/kuberay-version.

After revisiting it, I realized that what we’re actually comparing is HeadGroupSpec.Template and WorkerGroupSpecs.Template, which are just corev1.PodTemplateSpec. This comparison is independent of the KubeRay Operator version, so I’ll remove it.

cc @andrewsykim

Copy link
Collaborator Author

@win5923 win5923 Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert the changes: e4cb491

Thanks for the review!!

Signed-off-by: win5923 <ken89@kimo.com>
Signed-off-by: win5923 <ken89@kimo.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Identify and apply changes on ray-cluster

7 participants