Skip to content

Job submission and cluster execution

checkmaite traditionally executes capabilities through capability.run(...), which blocks until the run finishes. For small local workloads that is fine. For long-running evaluations, it creates two problems:

  1. Interactivity — notebook users cannot keep working smoothly while a run is executing.
  2. Compute scaling — one local Python process is a poor fit for capabilities that need more CPU/GPU resources or cluster execution.

The job-submission subsystem addresses both problems:

  • it gives users a non-blocking job handle,
  • it lets the same API target local or distributed job-submission backends.

The shared job handle contract, lifecycle states, reference-first results, and error semantics.

What backend-level settings must be configured before submission, including execution target, worker environment, storage, and shared job identity.

The default registry/controller-backed Ray job backend for reattachable jobs.

The direct process-local Ray task-based job backend for simple single-driver workflows.

Guidance for platform teams on container images, Ray worker setup, and runtime_env overlays.

Kubernetes-specific guidance for KubeRay placement, detached actors, autoscaling, and durability boundaries.

Why durable result writes are more subtle in distributed execution and what job submission expects from the configured store.