Job submission and cluster execution
checkmaite traditionally executes capabilities through capability.run(...), which blocks until the run finishes. For small local workloads that is fine. For long-running evaluations, it creates two problems:
- Interactivity — notebook users cannot keep working smoothly while a run is executing.
- Compute scaling — one local Python process is a poor fit for capabilities that need more CPU/GPU resources or cluster execution.
The job-submission subsystem addresses both problems:
- it gives users a non-blocking job handle,
- it lets the same API target local or distributed job-submission backends.
What to read next
The shared job handle contract, lifecycle states, reference-first results, and error semantics.
What backend-level settings must be configured before submission, including execution target, worker environment, storage, and shared job identity.
The default registry/controller-backed Ray job backend for reattachable jobs.
The direct process-local Ray task-based job backend for simple single-driver workflows.
Guidance for platform teams on container images, Ray worker setup, and runtime_env overlays.
Kubernetes-specific guidance for KubeRay placement, detached actors, autoscaling, and durability boundaries.
Why durable result writes are more subtle in distributed execution and what job submission expects from the configured store.