-
Notifications
You must be signed in to change notification settings - Fork 664
[Feature] Support recreate pods for RayCluster using RayClusterSpec.upgradeStrategy #4185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
710166a to
d261b0b
Compare
|
Hi @andrewsykim, I followed you previous comments to adding a spec.upgradeStrategy API to RayCluster. But for now. I'm concerned this approach may introduce some issues:
Maybe we can just add a feature gate instead of using spec.upgradeStrategy.type field in RayCluster to enable the recreate behavior. WDYT? |
Feature gates are used to gate features that are in early development and not ready for wider adoption, it shouldn't be used to change the behavior of RayCluster because it will eventually be on by default (and forced on). |
|
I think both of those concerns are valid, but I don't think this is a problem with separation of concerns as RayCluster is a building block for both RayService and RayJob. For those cases you mentioned, we should have validation to ensure RayCluster upgrade strategy cannot be set when used w/ RayJob and RayService |
05b8108 to
7109cf1
Compare
3d448e6 to
8bcce91
Compare
Signed-off-by: win5923 <ken89@kimo.com>
8bcce91 to
bf87764
Compare
c9d35b2 to
8d4c813
Compare
99d114b to
acd7739
Compare
Signed-off-by: Rueian <rueiancsie@gmail.com>
Signed-off-by: win5923 <ken89@kimo.com>
0b50555 to
73332af
Compare
Signed-off-by: win5923 <ken89@kimo.com>
73332af to
be2068f
Compare
Future-Outlier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need some review help, cc @seanlaii @JiangJiaWei1103 @CheyuWu
dcbfe9a to
67152e9
Compare
Signed-off-by: win5923 <ken89@kimo.com>
67152e9 to
8c82e7b
Compare
Something to be mindful of with this strategy is that adding new fields or slightly changing some property of the template (e.g. changing default entrypoint), could trigger an unintended recreate when a user upgrades KubeRay. I think we ran into something similar when dealing with upgrade behavior in RayService, see #2320 for reference. |
| NewCluster RayServiceUpgradeType = "NewCluster" | ||
| // No new cluster will be created while the strategy is set to None | ||
| None RayServiceUpgradeType = "None" | ||
| RayServiceUpgradeNone RayServiceUpgradeType = "None" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also rename two constants above to start with RayService*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated in f3a9b81.
|
|
||
| const ( | ||
| // During upgrade, Recreate strategy will delete all existing pods before creating new ones | ||
| Recreate RayClusterUpgradeType = "Recreate" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
| Recreate RayClusterUpgradeType = "Recreate" | |
| RayClusterUpgradeRecreate RayClusterUpgradeType = "Recreate" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated in f3a9b81
Signed-off-by: win5923 <ken89@kimo.com>
08fe6c5 to
f3a9b81
Compare
Updated: I think we don't need to check whether the KubeRayVersion is same or not, cause we only compare |
Signed-off-by: win5923 <ken89@kimo.com>
a59d6f9 to
5f3afb3
Compare
Co-authored-by: Nary Yeh <60069744+machichima@users.noreply.github.com> Signed-off-by: Jun-Hao Wan <ken89@kimo.com>
f2e53cf to
9c06b5f
Compare
| for _, pod := range allPods.Items { | ||
| podVersion := pod.Annotations[utils.KubeRayVersion] | ||
| if podVersion != "" && podVersion != utils.KUBERAY_VERSION { | ||
| logger.Info("Pods have different KubeRay version, updating pod annotations", | ||
| "pod", pod.Name, | ||
| "podVersion", podVersion, | ||
| "currentVersion", utils.KUBERAY_VERSION) | ||
| if err := r.updatePodsAnnotations(ctx, instance, &allPods); err != nil { | ||
| logger.Error(err, "Failed to update pod annotations for KubeRay version change") | ||
| } | ||
| return false | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite understand this part. When the KubeRayVersion updated, we will update the pods' kuberay version annotation, then what will happen? It seems like we do not have extra handling when kuberay version update. If so, should we just remove it? Then I think we do not need updatePodsAnnotations and calculatePodTemplateHash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial thought was to follow the approach used in RayService:
when the KubeRay Operator is upgraded, the reconciliation logic entering shouldRecreatePodsForUpgrade would first check whether the KubeRayVersion has changed. If it differed, we would compute a new pod template hash and update both ray.io/pod-template-hash and ray.io/kuberay-version.
After revisiting it, I realized that what we’re actually comparing is HeadGroupSpec.Template and WorkerGroupSpecs.Template, which are just corev1.PodTemplateSpec. This comparison is independent of the KubeRay Operator version, so I’ll remove it.
cc @andrewsykim
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert the changes: e4cb491
Thanks for the review!!
…r pods" This reverts commit 5f3afb3.
Signed-off-by: win5923 <ken89@kimo.com>
Signed-off-by: win5923 <ken89@kimo.com>
Why are these changes needed?
Currently, when users update a RayCluster spec (e.g., update the image), users must re-create the cluster or manually set spec.suspend to true and after all Pods are deleted and then set it back to false which is not particularly convenient for users deploying with GitOps systems like ArgoCD.
Ref:
Design doc: https://docs.google.com/document/d/1xQLm0-WQWD-FkufxBJYklOJGvVn4RLk0_vPjLD5ax7o/edit?usp=sharing
Changes
spec.upgradeStrategyfield to RayCluster CRDRecreate: During upgrade, Recreate strategy will delete all existing pods before creating new ones.None: No new pod will be created while the strategy is set to NoneImplementation
HeadGroupSpec.Templateto head pod andworkerGroup.Templateto worker pod annotations during creation withray.io/pod-template-hashI only compare the
HeadGroupSpec.TemplateandworkerGroup.Templatebecause these define the pod-related configurations. TheRayCluster.Speccontains many dynamic and component-specific settings (e.g., autoscaler configs, rayStartParams).Example:
Related issue number
Closes #2534 #3905
Checks