Kubernetes and KubeRay deployment notes
This page collects Kubernetes-specific guidance for running checkmaite job
submission on KubeRay. General worker image and Ray runtime_env guidance lives
in Worker environments.
The exact cluster YAML belongs to the platform repository, not to checkmaite.
The important point is that the platform image and RayCluster definition live
outside checkmaite, while the job backend connects to that cluster and
submits work into it.
KubeRay-style deployment model
A typical RayCluster separates the head pod from one or more worker groups:
apiVersion: ray.io/v1
kind: RayCluster
spec:
headGroupSpec:
template:
spec:
containers:
- name: ray-head
image: registry.example.com/checkmaite-ray:2026-04-14
workerGroupSpecs:
- groupName: cpu-workers
replicas: 2
template:
spec:
containers:
- name: ray-worker
image: registry.example.com/checkmaite-ray:2026-04-14
resources:
limits:
cpu: "8"
memory: "32Gi"
- groupName: gpu-workers
replicas: 1
template:
spec:
containers:
- name: ray-worker
image: registry.example.com/checkmaite-ray-gpu:2026-04-14
resources:
limits:
nvidia.com/gpu: "1"
cpu: "8"
memory: "64Gi"
Use deployment-specific images, resources, node selectors, tolerations, secrets, and autoscaling settings in your platform configuration.
Detached actors and scale-down
The default ray backend uses a detached registry actor and detached per-job
controller actors. Detached actors can remain alive after the submitting notebook
or driver exits, so they affect scheduling, autoscaling, pod placement, and
cleanup.
A pod that hosts a retained terminal controller actor generally cannot scale down until that actor is killed or forgotten. The registry actor is intentionally long-lived; if it lands on an autoscaled worker pod, that pod may stay alive for as long as the registry exists.
For Kubernetes deployments, consider aggressive terminal-controller cleanup when
reattach-through-controller is not needed after terminal state is committed, for
example controller_retention_s=0.0 and
max_retained_terminal_controllers=0. Submit-triggered sweeps are not enough for
reliable idle scale-down if the cluster becomes quiet after jobs finish.
Head node placement
Ray head pods have extra memory and control-plane pressure from GCS, dashboard,
and cluster services. Unless intentionally using the head for lightweight
control-plane actors, configure the Ray head with num-cpus: "0" so nonzero-CPU
user tasks and actors do not land there.
Controller actors should normally reserve a small nonzero CPU amount, such as
the default controller_num_cpus=0.01, or use a custom placement resource. Avoid
controller_num_cpus=0.0 in production Kubernetes unless placement is otherwise
controlled.
A clean production layout is often a small dedicated control-plane worker group for the registry actor, while normal worker groups run controller actors and capability tasks.
Registry actor placement and resources
The registry actor should have an explicit small resource reservation or custom
resource placement in production KubeRay deployments. Use registry_num_cpus,
registry_memory, and registry_resources to make placement explicit.
For example, a custom resource such as {"checkmaite-control-plane": 1} can
force the registry onto a dedicated control-plane worker group. Avoid allowing
the registry to land on arbitrary autoscaled worker groups if that would prevent
scale-down.
The registry remains a single serialized coordination point, so keep records
small and list operations bounded. registry_max_pending_calls defaults to
1024 to cap queued registry calls so many notebooks or clients do not build an
unbounded actor-call queue. controller_max_pending_calls defaults to 64 and
similarly caps queued calls on per-job controller actors. If those queues fill,
client calls raise BackpressureError; retry with exponential backoff and jitter
or tune the limits for your expected burst size. Passing None opts back into
Ray's unbounded pending-call behavior.
Durability boundary and workload fit
Detached actors survive notebook or driver termination, but they do not survive RayCluster deletion, full cluster recreation, or loss of actor memory. The in-memory registry is sufficient for reattach across client restarts while the RayCluster is alive. It is not durable job history across RayCluster replacement.
A production Kubernetes architecture that needs durable history should treat detached actors as the live control plane, an external database or object store as durable truth, and the Ray object store as transient data plane.
The per-job detached-controller design is intended for long-running capability jobs. Workloads with thousands of concurrent jobs or very short tasks may need a future design with a supervisor actor, controller pool, sharded registries, batched task tracking, or DB-backed coordination.