-
Notifications
You must be signed in to change notification settings - Fork 664
[Autoscaler] Add validation to require RayCluster v2 when using idleTimeoutSeconds #4162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[Autoscaler] Add validation to require RayCluster v2 when using idleTimeoutSeconds #4162
Conversation
|
LGTM |
|
@ryanaoleary can you review since you added the feature? |
a4bc057 to
ae88086
Compare
ryanaoleary
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I tested some of these cases manually and it looks correct.
|
@alimaazamat the e2e buildkite failure seems related? Can you take a look https://buildkite.com/ray-project/ray-ecosystem-ci-kuberay-ci/builds/11598#019a7efb-eeee-48fd-8614-6bd5d5311f4e Triggered a retry incase it's a flake |
@andrewsykim Yeah I reran it already previously and it did fail again. Because this are just unit tests I thought it wouldn't impact e2e? |
|
Actually I think I found the issue: |
|
@andrewsykim |
| WithAutoscalerOptions(rayv1ac.AutoscalerOptions(). | ||
| WithVersion(rayv1.AutoscalerVersionV2)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| WithAutoscalerOptions(rayv1ac.AutoscalerOptions(). | |
| WithVersion(rayv1.AutoscalerVersionV2)). |
Are these required? The tests[1] used here already has autoscaler v2 enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even thought tests[1] has v2 enabled as an env variable, but doesn't actually change the spec without
WithAutoscalerOptions(rayv1ac.AutoscalerOptions().
WithVersion(rayv1.AutoscalerVersionV2)).
I wanted it updated in the spec as well for the validation tests I wrote, since in validation.go the other tests reference the spec not env vars e.g. workerGroup.Template.Spec.RestartPolicy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/alimaazamat/kuberay/blob/idleTimeoutSecondsTests/ray-operator/controllers/ray/utils/validation.go#L180 "Please only use the former" spec version not env var
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also get
The RayCluster spec is invalid test-ns-k5rcx/ray-cluster: worker group short-idle-timeout-group has idleTimeoutSeconds set, but autoscaler version is not v2. Please set .spec.autoscalerOptions.version to v2 |
4475ce5 to
3fada64
Compare
jackfrancis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
marosset
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
c9677d9 to
7be6c82
Compare
Signed-off-by: alimaazamat <alima.azamat2003@gmail.com>
7be6c82 to
a608d94
Compare
|
@andrewsykim @ryanaoleary @rueian Changed validation to use env var for autoscaler v2 but according to this: #3578 and this comment #3578 (comment) I think we prefer using spec version not the env var. I am opening up a separate PR to change |
win5923
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, validating env var looks better for me. This also supports users of KubeRay versions < 1.4.0, which do not allow using the spec version. Thank you!
ded1e68 to
db7a2c6
Compare
Co-authored-by: Jun-Hao Wan <ken89@kimo.com> Signed-off-by: Alima Azamat <92766804+alimaazamat@users.noreply.github.com> Signed-off-by: alimaazamat <alima.azamat2003@gmail.com>
db7a2c6 to
d76a861
Compare
|
First check if spec version is set with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! cc @rueian for final review.
| if err := validateWorkerGroupIdleTimeout(workerGroup, spec); err != nil { | ||
| return err | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Can we put the logic to here?
kuberay/ray-operator/controllers/ray/utils/validation.go
Lines 184 to 195 in 5282a64
if IsAutoscalingV2Enabled(spec) { if spec.HeadGroupSpec.Template.Spec.RestartPolicy != "" && spec.HeadGroupSpec.Template.Spec.RestartPolicy != corev1.RestartPolicyNever { return fmt.Errorf("restartPolicy for head Pod should be Never or unset when using autoscaler V2") } for _, workerGroup := range spec.WorkerGroupSpecs { if workerGroup.Template.Spec.RestartPolicy != "" && workerGroup.Template.Spec.RestartPolicy != corev1.RestartPolicyNever { return fmt.Errorf("restartPolicy for worker group %s should be Never or unset when using autoscaler V2", workerGroup.GroupName) } } } } -
Can we also add validation logic for
AutoscalerOptions.IdleTimeoutSeconds?
kuberay/ray-operator/apis/ray/v1/raycluster_types.go
Lines 189 to 192 in 58c2aad
// IdleTimeoutSeconds is the number of seconds to wait before scaling down a worker pod which is not using Ray resources. // Defaults to 60 (one minute). It is not read by the KubeRay operator but by the Ray autoscaler. // +optional IdleTimeoutSeconds *int32 `json:"idleTimeoutSeconds,omitempty"`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Future-Outlier Why would we put the validation logic in validateWorkerGroupIdleTimeout? Above it we have other validation functions like validateRayGroupLabels and validateRayGroupResources that are structured the same way: where we define those functions seperate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For validation logic for AutoscalerOptions.IdleTimeoutSeconds I agree, let's do that in a seperate PR. This one is scoped out for worker groups.
Why are these changes needed?
V2 Autoscaling allows for configuring idleTimeoutSeconds per worker group, which needs added unit + e2e testing
EDIT: e2e testing already done in #2725 issue should've been closed
Related issue number
#2561
Checks