Ray worker environments
This page is aimed primarily at the platform team operating Ray clusters for checkmaite job submission.
The short version is:
checkmaitedoes not build worker environments for you,- the platform owns the worker image and cluster spec,
- and the job backend can only supply a Ray
runtime_envoverlay at connection time.
Both "ray" and "ray-simple" use Ray workers and the same runtime_env mechanics. The default "ray" job backend also runs registry/controller actors, while "ray-simple" submits direct Ray tasks from one driver.
Two practical modes
Local development mode
For local development, Ray workers run from the developer's current Python environment. Use "ray" when you want registry-backed reattach behavior, or "ray-simple" when a process-local direct Ray task-based job backend is enough.
configure_job_backend(
"ray",
address="local",
idempotency_scope="local-dev",
analytics_store={"backend": "parquet", "uri": "./analytics_store"},
)
This is the lightest-weight setup and is what the walkthrough notebook demonstrates.
Platform / cluster mode
For shared infrastructure, workers should start from a platform-managed base image.
That image should already contain the heavy, stable parts of the environment:
- Python
- Ray
checkmaite- CUDA / PyTorch when GPUs are involved
- system libraries needed by capabilities and model code
- storage connectors needed by your deployment (for example
s3fs,gcsfs,adlfs)
Recommended Docker model
The current code is best thought of as a platform image + Ray overlay model.
Base image responsibilities
The base image should provide everything needed for workers to import and execute capability code reliably.
A practical layering strategy is:
- base OS and security hardening,
- Python and Ray,
- CUDA / PyTorch stack if needed,
- pinned
checkmaiteversion and its heavy dependencies, - storage and platform integration libraries.
That keeps the worker startup path predictable and avoids reinstalling the expensive parts of the environment on every task.
Ray runtime_env responsibilities
Use runtime_env for smaller, faster-changing overlays such as:
- environment variables,
- a
working_dirorpy_modulesbundle for iterative code updates, - small supplemental Python packages.
Example:
configure_job_backend(
"ray",
address="ray://cluster-head:10001",
idempotency_scope="team-a-prod-evals",
runtime_env={
"working_dir": ".",
"pip": ["my-small-lib==0.3.1"],
"env_vars": {
"MODEL_REGISTRY_URL": "https://registry.internal",
},
},
analytics_store={
"backend": "parquet",
"uri": "s3://team-checkmaite/analytics-store",
"storage_options": {"anon": False},
},
)
How this maps to the current code
Both Ray job backends accept:
- Ray connection and environment options through
configure_job_backend(..., **kwargs) - analytics-store configuration through the explicit
analytics_store=...argument
Those concerns are separate on purpose:
runtime_envcontrols how Ray workers are prepared,analytics_storetells workers where durable run data should be written.
Platform-team checklist
For a production cluster, make sure workers can:
- Import the same code the client expects
- capability classes must be importable on workers,
- run models must deserialize correctly,
-
version skew between client and workers should be avoided.
-
Access input data and models
- dataset URIs must resolve from worker nodes,
- model artifacts must be reachable from worker nodes,
-
credentials must be present in the worker environment.
-
Access the analytics store
- workers must be able to write to the configured store URI,
-
the client must also be able to read from that same durable location later.
-
Expose the right compute resources
- CPU and GPU resources must be visible to Ray,
- and the cluster should be sized for the expected capability mix.
The default "ray" job backend also needs the worker image to import the registry/controller code. The "ray-simple" job backend only needs the worker task code and submitted capability dependencies.
Example: object-store analytics store
The current jobs analytics-store configuration supports the Parquet backend with a URI and optional storage options.
configure_job_backend(
"ray",
address="ray://cluster-head:10001",
idempotency_scope="team-a-prod-evals",
analytics_store={
"backend": "parquet",
"uri": "s3://team-checkmaite/results",
"storage_options": {
"anon": False,
},
},
runtime_env={
"env_vars": {
"AWS_REGION": "us-east-1",
}
},
)
For this to work in practice:
- workers need credentials that can write to that bucket,
- the client needs credentials that can later read the same bucket,
- and the worker image must include the storage dependencies required by the deployment.
Practical guidance
Prefer heavy dependencies in the image
Put large and slow-moving dependencies in the image:
checkmaite- PyTorch / CUDA
- large model-serving dependencies
- storage connectors used everywhere
Use runtime_env for deltas, not full environments
Ray can install packages via runtime_env["pip"], but using that for entire heavyweight environments increases cold-start time and operational variability.
Pin versions across client and worker
The client serializes capability objects and expects workers to import compatible code. Loose versioning can create subtle failures. Treat the worker image, the client environment, and any runtime_env overlay as one versioned deployment unit.
Current limitations
The current job backend deliberately stops short of becoming a packaging system.
It does not:
- build Docker images,
- publish environments,
- manage lockfiles for the platform,
- or guarantee cross-cluster compatibility automatically.
That work belongs in platform tooling, cluster configuration, and release discipline.