Ray simple job backend (`kind="ray-simple"`)

ray-simple is the lightweight Ray job backend for checkmaite jobs:

from checkmaite.jobs import configure_job_backend

configure_job_backend(
    "ray-simple",
    analytics_store={"backend": "parquet", "uri": "./job-results"},
)

It submits one Ray task for each capability run and returns a local RaySimpleJob handle. It is intentionally much simpler than the default ray job backend.

Use it when you want something easy to understand, easy to debug, and good enough for local development, demos, notebooks, or simple single-driver workflows. Do not use it as a durable shared job service.

Important

ray-simple trades durability for simplicity. It is often the easiest job backend to use for demos and local notebooks, but the submitting client is responsible for duplicate-submission policy, crash recovery, and keeping job handles alive. If those responsibilities are not acceptable, use the default ray job backend.

Practical guidance

For demos and local notebooks, ray-simple is often the easiest job backend to start with. It has fewer moving parts than the default ray job backend because it does not create a shared registry actor or per-job controller actors. That makes it easier to debug basic worker execution, analytics-store writes, and job result handling.

The tradeoff is that operational responsibilities move to the user:

you decide whether duplicate submissions are safe;
you decide how to recover after a client crash;
you keep the submitting process alive while using lifecycle APIs;
you choose durable storage for completed run data;
you avoid sending very large Python objects through Ray serialization when a URI, object-store reference, or other external storage reference would be more appropriate.

If those tradeoffs are unacceptable, use the default ray job backend instead.

When to use `ray-simple`

ray-simple is a good fit when:

one Python process or notebook submits and watches the jobs;
losing job handles after the client exits is acceptable;
duplicate submissions are acceptable or are handled by your own code;
you want less job backend machinery while developing or debugging;
you are running demos or small experiments where operational simplicity matters more than durability.

Prefer the default ray job backend when:

jobs must survive notebook or driver restarts;
another client must list, cancel, or reconnect to existing jobs;
duplicate submissions must be suppressed by the job backend;
multiple users or processes share the same Ray cluster;
you need production-style job tracking on KubeRay or a long-running Ray cluster.

End-to-end flow

sequenceDiagram
    participant Client
    participant JobBackend as RaySimpleJobBackend
    participant Ray as Ray runtime
    participant Worker as Ray worker
    participant Store as AnalyticsStore

    Client->>JobBackend: submit_capability(...)
    JobBackend->>Ray: ray.remote(_execute_capability_ref).remote(...)
    JobBackend-->>Client: RaySimpleJob (local handle)
    Ray->>Worker: execute _execute_capability_ref
    Worker->>Worker: capability.run(..., use_cache=False)
    Worker->>Store: write_with_receipt([run])
    Store-->>Worker: payload URI
    Worker-->>Ray: CapabilityRunRef or exception
    Client->>JobBackend: job.status / wait() / result() / cancel()
    JobBackend->>Ray: ray.wait / ray.get / ray.cancel

The important difference from the default ray backend is that the local RaySimpleJobBackend object owns the remembered job handles. There is no shared registry and no detached per-job controller actor.

Public usage

1. Configure the job backend

from checkmaite.jobs import configure_job_backend

configure_job_backend(
    "ray-simple",
    address="local",
    analytics_store={"backend": "parquet", "uri": "./analytics_store"},
)

2. Submit work

from checkmaite.jobs import submit_capability

job = submit_capability(
    capability,
    datasets=[dataset],
    models=[model],
    metrics=[metric],
    config=config,
    use_cache=False,
)

ray-simple also rejects use_cache=True. Job-submission workers are ephemeral and do not share the client's capability-local cache, so worker execution always uses use_cache=False. Durable reuse should be handled through your analytics store or by using the registry-backed ray backend's idempotent submission semantics.

3. Inspect lifecycle and retrieve the result reference

print(job.job_id)
print(job.status)
print(job.wait(timeout=0.1))

ref = job.result(timeout=300)
print(ref.run_uid)
print(ref.store_uri)

4. List remembered jobs

from checkmaite.jobs import JobStatus, list_jobs

recent = list_jobs(limit=100)
completed = list_jobs(limit=50, status_filter=JobStatus.COMPLETED)

list_jobs(...) only lists jobs remembered by the current backend object.

Status mapping

RaySimpleJob maps one Ray task into the shared JobStatus protocol:

first non-ready observation -> PENDING
later non-ready observations -> RUNNING
successful task result -> COMPLETED
Ray task cancellation -> CANCELLED
task exception -> FAILED

Ray task readiness does not distinguish queued work from executing work, so the PENDING / RUNNING split is a local polling heuristic.

Assumptions

Local job identity and recovery

ray-simple tracks jobs only in the current RaySimpleJobBackend object. If the notebook, driver, or Python process exits, list_jobs(), get_job(job_id), waiting, and cancellation lose access to those jobs.

Every submit_capability(...) call creates a new Ray task. There is no shared registry, idempotency_scope, duplicate suppression, or crash recovery. Use the default ray backend when shared metadata or reattach behavior is required.

Result storage assumptions

job.result() returns a small CapabilityRunRef, not the completed capability run payload. Completed run data is written through the configured analytics store.

Runtime and lifecycle assumptions

Status and cancellation are best-effort observations of one Ray task. cancel() issues ray.cancel(...), but Ray cancellation does not guarantee user code or side effects have not already run. result() and wait() timeouts do not cancel the task.

RaySimpleJobBackend shares the process Ray runtime: shutdown(wait=True) waits for known jobs and calls ray.shutdown(), while shutdown(wait=False) returns without shutting Ray down.

Ray simple job backend (kind="ray-simple")