Skip to content

feat(RHOAIENG-26487): Cluster lifecycling via RayJob #873

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: ray-jobs-feature
Choose a base branch
from

Conversation

chipspeak
Copy link
Contributor

@chipspeak chipspeak commented Jul 31, 2025

Issue link

Jira

What changes have been made

Support has been added to submit a RayJob that will create and lifecycle its own cluster.

Verification steps

Prerequisites

  • Build a CodeFlare SDK whl file based on this branch.
  • Disable Kueue in your RHOAI cluster.
  • Log in to OpenShift on your local machine via oc login
  • Create a local jupyter notebook in advance of pasting the below cells.

Steps

  1. Run poetry build.
  2. Copy the below as a cell into the jupyter notebook and execute it:
# This obviously presumes the location of your whl file. Adjust as needed.
%pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ dist/codeflare_sdk-0.0.0.dev0-py3-none-any.whl --force-reinstall
  1. Once the install is complete, restart your notebooks kernel.
  2. Next, copy the below into a cell and execute it (ensuring the names are within the 63 character limit):
from codeflare_sdk import RayJob, ClusterConfiguration

# Create cluster configuration for auto-creation
cluster_config = ClusterConfiguration(
    name="",  # Will auto-generate as "{job_name}-cluster"
    namespace="rhods-notebooks",
    head_cpu_requests='1',
    head_cpu_limits='2',
    head_memory_requests=4,
    head_memory_limits=5,
    head_extended_resource_requests={'nvidia.com/gpu':0},
    worker_extended_resource_requests={'nvidia.com/gpu':0},
    num_workers=1,
    worker_cpu_requests='1',
    worker_cpu_limits='2',
    worker_memory_requests=3,
    worker_memory_limits=4,
)

# Create RayJob with embedded cluster - will auto-create and manage cluster lifecycle
ray_job = RayJob(
    job_name="test-lifecycle",
    cluster_config=cluster_config,  # This triggers auto-cluster creation
    namespace="rhods-notebooks",
    entrypoint="python -c 'import time; print(\"Job starting...\"); time.sleep(15); print(\"Job completed!\")'",
    shutdown_after_job_finishes=True,  # Auto-cleanup cluster after job finishes
    ttl_seconds_after_finished=30,     # Wait 30s after job completion before cleanup
)

ray_job.submit()

print(f"RayJob '{ray_job.name}' configured to create cluster '{ray_job.cluster_name}'")
  1. If you open your namespace and check the pods and you should observe both a job and cluster creation. You can verify the job status by running:
ray_job.status()
  1. Once the job has completed, you should see the cluster pods terminate after the 30 seconds set by the RayJob.
  2. Check the logs of the RayJob and you should see success messages as applied in the entrypoint.
  3. Delete the RayJob.

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

@openshift-ci openshift-ci bot requested review from dimakis and pawelpaszki July 31, 2025 16:26
Copy link
Contributor

openshift-ci bot commented Jul 31, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kryanbeane for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@chipspeak
Copy link
Contributor Author

Supporting screenshots below:
Screenshot 2025-07-31 at 17 13 42
Screenshot 2025-07-31 at 17 14 04
Screenshot 2025-07-31 at 17 15 03

Copy link

codecov bot commented Jul 31, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.27%. Comparing base (e2fc98b) to head (9a01141).

Additional details and impacted files
@@                 Coverage Diff                  @@
##           ray-jobs-feature     #873      +/-   ##
====================================================
+ Coverage             93.06%   93.27%   +0.21%     
====================================================
  Files                    28       28              
  Lines                  1513     1561      +48     
====================================================
+ Hits                   1408     1456      +48     
  Misses                  105      105              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Pat O'Connor <paoconno@redhat.com>
@chipspeak
Copy link
Contributor Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 31, 2025
@chipspeak
Copy link
Contributor Author

/retest

@laurafitzgerald
Copy link
Contributor

laurafitzgerald commented Aug 1, 2025

I've verified this change works as described
I still need to do a code review but if someone else has time to do that, please work away.

Copy link
Contributor

@kryanbeane kryanbeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some requested changes for reference @chipspeak. I also left random reminders for myself when I finish this PR off next week.

cc: @LilyLinh

if cluster_config is not None:
# Ensure cluster config has the same namespace as the job
if cluster_config.namespace is None:
cluster_config.namespace = namespace
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to use ClusterConfiguration for creating the rayClusterSpec, we should either:

  • don't take a Namespace in the ClusterConfig as Kuberay will create the Cluster in the same namespace as the RayJob

or

  • remove the requirement for k8s_namespace for the new RayJob object.

Lets put namespace in just one place and let the RayCluster inherit it from the job. I will add this next week when updating this PR

"""
# Validate required parameters
if not self.entrypoint:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: add validation for runtime_env

nit: add helper functions to validate shape etc of runtime_env and entrypoint

temp_config.write_to_file = False

# Create a minimal Cluster object for the build process
from ..cluster.cluster import Cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we move in-function imports to top of file? unsure of best python practice there

# Create a minimal Cluster object for the build process
from ..cluster.cluster import Cluster

temp_cluster = Cluster.__new__(Cluster) # Create without calling __init__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird syntax here although it might just be me. I'll look into this


# Extract just the RayCluster spec - RayJob CRD doesn't support metadata in rayClusterSpec
# Note: CodeFlare Operator should still create dashboard routes for the RayCluster
ray_cluster_spec = ray_cluster_dict["spec"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

propogate the metadata from the ray job here. Anything we need for the cluster, add to the rayjob

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants