Job submission and cluster execution

checkmaite traditionally executes capabilities through capability.run(...), which blocks until the run finishes. For small local workloads that is fine. For long-running evaluations, it creates two problems:

Interactivity — notebook users cannot keep working smoothly while a run is executing.
Compute scaling — one local Python process is a poor fit for capabilities that need more CPU/GPU resources or cluster execution.

The job-submission subsystem addresses both problems:

it gives users a non-blocking job handle,
it lets the same API target local or distributed job-submission backends.

What to read next

Protocol and lifecycle

The shared job handle contract, lifecycle states, reference-first results, and error semantics.

Job backend configuration (configure_job_backend)

What backend-level settings must be configured before submission, including execution target, worker environment, storage, and shared job identity.

Ray job backend

The default registry/controller-backed Ray job backend for reattachable jobs.

Ray simple job backend

The direct process-local Ray task-based job backend for simple single-driver workflows.

Worker environments

Guidance for platform teams on container images, Ray worker setup, and runtime_env overlays.

Kubernetes and KubeRay

Kubernetes-specific guidance for KubeRay placement, detached actors, autoscaling, and durability boundaries.

Distributed analytics store

Why durable result writes are more subtle in distributed execution and what job submission expects from the configured store.