The Analytics Store¶

This notebook introduces the analytics store — a structured storage layer that makes capability run results queryable via SQL.

We cover why the store exists, how to use and extend it, and the on-disk file layout.

In local synchronous execution, compute/write/read all happen in one process. In distributed job submission, those responsibilities split across client and worker processes, so the client must explicitly pass analytics-store configuration (configure_job_backend(..., analytics_store=...)) to tell workers where to write durable data that the client will later read.

See Analytics store in distributed execution and Job backend configuration.

Part 1: Why¶

The problem¶

When capabilities run, they produce rich Python objects — numpy arrays, nested dicts, per-image statistics, binary tensors. These are useful for programmatic access, but they aren't queryable:

No cross-run queries. To answer "which of my CIFAR-10 runs had the highest accuracy?", you'd need to load every run object, check its type and dataset, and extract the metric manually. There's no index, no schema, and no way to filter without loading everything.
Not accessible outside Python. Run results contain Python-specific structures (class paths, binary blob references) that only checkmaite can resolve. A data analyst with DuckDB or a BI dashboard cannot query them.
Too much data for comparison. A DataevalCleaningRun includes raw per-image dimension arrays, per-image visual statistics, full outlier dictionaries, etc. Most cross-run analysis only needs aggregate summaries: "how many duplicates?", "what was the mean brightness?", "what was the class imbalance ratio?"

What the analytics store provides¶

The analytics store is the public API for querying capability results across runs. It provides:

Flat, typed tables — one per capability — where each row is a scalar-only summary of a run. When a run produces variable-length data (e.g. multiple metrics), multiple rows are created. No nested objects, no binary references, no Python class paths.
SQL as the primary query interface. Filter, join, aggregate across runs and across capabilities using standard SQL. Every StorageBackend implementation exposes a query_sql() method.
Plain Parquet files on disk with the default backend. Readable by DuckDB, Spark, pandas, pyarrow, Snowflake, or any other tool that speaks Parquet — no Python required. Other backends (DuckDB, Postgres, etc.) provide their own native access paths.
A StorageBackend protocol that allows swapping Parquet for DuckDB, Delta Lake, Postgres, etc. when scale demands it.

Beyond local: platform integration¶

Note: This notebook demonstrates local store.write(...) usage. In job submission mode, workers can be remote and must be told where to write analytics data.

In local synchronous execution, the process that computes the run already knows:

where the analytics store lives,
how to write to it,
and how to read it back later.

In distributed job submission, those responsibilities are split:

the client chooses the durable store location,
the worker needs that information so it can persist results,
and the client later needs a stable way to find/read the payload data the worker wrote.

That is why configure_job_backend(...) requires explicit analytics-store configuration:

from checkmaite.jobs import configure_job_backend

configure_job_backend(
    "ray",
    analytics_store={
        "backend": "parquet",
        "uri": "./analytics_store",
    },
)

checkmaite.jobs._store defines the typed AnalyticsStoreConfig; the job backend then forwards that config with each submission so workers build/write to the intended store location. See Analytics store in distributed execution and Job backend configuration.

The store's design — scalar-only Parquet tables behind a swappable backend protocol — also opens a path to platform-level integration.

Consider a Databricks deployment:

Store Parquet files on cloud storage (S3, ADLS, GCS) and they are immediately queryable as Delta Lake tables, external tables, or via read_parquet() in Databricks SQL or Spark — no Python glue needed.
Implement a DeltaLakeBackend (or a DatabricksBackend using Unity Catalog) and the store writes directly into managed tables. Capability results become first-class catalog objects: discoverable, governed, and queryable by any team member with SQL access — not just the Python users who ran the capabilities.
Cross-team analytics become possible. A data scientist runs capabilities locally or in a Databricks notebook; an ML engineer queries the results via Databricks SQL; a program manager builds a dashboard on the same tables. Everyone operates on the same structured data without any custom export step.

None of this requires changes to the store API or to capability extract() implementations — only a new StorageBackend.

Key design decisions¶

These decisions were taken deliberately and inform everything that follows.

Decision	Rationale
One table per capability	For example, there are tables named `dataeval_cleaning`, `maite_evaluation`, etc. Each table has a fixed schema defined by a `BaseRecord` subclass for each capability. Capabilities produce structurally different outputs, so separate tables with distinct schemas are the natural representation.
Flat, scalar-only records	Every field must be `str`, `int`, `float`, `bool`, `bytes`, `datetime`, or `Optional` variants. No lists, dicts, or nested models. This guarantees that every record maps to a Parquet/SQL row without transformation. Variable-length data (e.g. per-metric results) uses multiple records instead.
`StorageBackend` protocol	The store doesn't know or care how data is persisted. The default Parquet backend is the simplest thing that works. When you outgrow it, swap in a DuckDB, Delta Lake, or Databricks backend without changing any store or record code.
Append-only, immutable writes	Run results are historical facts. Each `write()` call persists records via the configured backend. No updates, no deletes. Erroneous runs are handled by re-running and filtering in queries.
Idempotent writes (across calls)	Writing the same `run_uid` twice across separate `write()` calls is a no-op (deduplicated by `run_uid`). Safe for notebook re-execution. Note: deduplication is checked against previously written files — duplicate `run_uid` values within a single `write()` call are not deduplicated.
Automatic `runs` table	Maps every `run_uid` to its datasets, models, and metrics. Capability authors don't manage this — the store writes it automatically.

What the store is NOT¶

Not an experiment tracker. There is no tagging, no run comparison UI, no artifact storage. It is a structured data layer that those tools can be built on top of.

Part 2: How¶

Architecture overview¶

capability.run()
  │
  └── returns CapabilityRunBase
              │
              │  .extract()
              ▼
        [BaseRecord, ...]            ── scalar summaries
              │
              ▼
        AnalyticsStore.write()
          │              │
          ▼              ▼
     capability      runs table
      records       (auto-generated)
          │              │
          ▼              ▼
      StorageBackend.write()
          │
  ┌───────┼────────┐
  ▼       ▼        ▼
Parquet  DuckDB  Postgres
(default) (future) (future)

The store is populated explicitly by calling store.write([run1, run2]), which invokes each run's extract() method to produce flat records.

Setup¶

Let's create a store backed by a temporary Parquet directory.

In [1]:

Copied!





import tempfile
from pathlib import Path

from checkmaite.core.analytics_store import AnalyticsStore, ParquetBackend

# In practice you'd use a persistent path like "./analytics_store"
store_dir = tempfile.mkdtemp(prefix="analytics_store_")
store = AnalyticsStore(ParquetBackend(store_dir))
print(f"Store path: {store_dir}")
import tempfile
from pathlib import Path

from checkmaite.core.analytics_store import AnalyticsStore, ParquetBackend

# In practice you'd use a persistent path like "./analytics_store"
store_dir = tempfile.mkdtemp(prefix="analytics_store_")
store = AnalyticsStore(ParquetBackend(store_dir))
print(f"Store path: {store_dir}")

Store path: /tmp/analytics_store_20fvb7wc

/home/runner/work/checkmaite/checkmaite/.venv/lib/python3.10/site-packages/xaitk_saliency/__init__.py:3: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources

Writing records directly (understanding the primitives)¶

Before using the full store.write(runs) workflow, let's see what records look like and how the backend works. This makes the abstractions concrete.

A BaseRecord subclass defines a table schema. Every field must be a scalar type — this is enforced at class definition time.

In [2]:

Copied!





from checkmaite.core.analytics_store import BaseRecord


# A valid record: all fields are scalar
class ExampleRecord(BaseRecord, table_name="example"):

    dataset_id: str
    score: float
    sample_count: int
    notes: str | None = None


record = ExampleRecord(
    run_uid="abc123",
    dataset_id="cifar10",
    score=0.95,
    sample_count=50000,
)
print(record)
print(f"\nTable: {record.table_name}")
print(f"Serialised: {record.model_dump(mode='python')}")
from checkmaite.core.analytics_store import BaseRecord


# A valid record: all fields are scalar
class ExampleRecord(BaseRecord, table_name="example"):

    dataset_id: str
    score: float
    sample_count: int
    notes: str | None = None


record = ExampleRecord(
    run_uid="abc123",
    dataset_id="cifar10",
    score=0.95,
    sample_count=50000,
)
print(record)
print(f"\nTable: {record.table_name}")
print(f"Serialised: {record.model_dump(mode='python')}")

run_uid='abc123' created_at=datetime.datetime(2026, 5, 22, 18, 32, 15, 820380, tzinfo=datetime.timezone.utc) dataset_id='cifar10' score=0.95 sample_count=50000 notes=None

Table: example
Serialised: {'run_uid': 'abc123', 'created_at': datetime.datetime(2026, 5, 22, 18, 32, 15, 820380, tzinfo=datetime.timezone.utc), 'dataset_id': 'cifar10', 'score': 0.95, 'sample_count': 50000, 'notes': None}

In [3]:

Copied!





# This will fail: list is not a scalar type
try:
    class BadRecord(BaseRecord, table_name="bad"):
        tags: list[str]  # NOT allowed
except TypeError as e:
    print(f"Rejected: {e}")
# This will fail: list is not a scalar type
try:
    class BadRecord(BaseRecord, table_name="bad"):
        tags: list[str]  # NOT allowed
except TypeError as e:
    print(f"Rejected: {e}")

Rejected: Field 'tags' on BadRecord uses non-scalar type list[str]. BaseRecord subclasses must use only flat types (str, int, float, bool, bytes, datetime, or Optional variants). If you need variable-length data, return multiple records from extract().

The scalar constraint is what makes every record directly queryable via SQL — no binary blobs, no nested structures, no Python-specific references.

If a capability produces variable-length data (e.g. one metric value per class), the extract() method returns multiple records — one per logical row. This is the Entity-Attribute-Value (EAV) pattern.

Writing and querying via the backend¶

The ParquetBackend handles file layout and SQL execution. Let's write some records and query them.

In [4]:

Copied!





backend = ParquetBackend(store_dir)

# Write records from two different "runs"
backend.write([
    ExampleRecord(run_uid="run_1", dataset_id="cifar10", score=0.92, sample_count=50000),
    ExampleRecord(run_uid="run_2", dataset_id="mnist", score=0.98, sample_count=60000),
    ExampleRecord(run_uid="run_3", dataset_id="cifar10", score=0.94, sample_count=50000, notes="augmented"),
])

print("Tables:", backend.list_tables())
print("\nSchema:", backend.describe_table("example"))
backend = ParquetBackend(store_dir)

# Write records from two different "runs"
backend.write([
    ExampleRecord(run_uid="run_1", dataset_id="cifar10", score=0.92, sample_count=50000),
    ExampleRecord(run_uid="run_2", dataset_id="mnist", score=0.98, sample_count=60000),
    ExampleRecord(run_uid="run_3", dataset_id="cifar10", score=0.94, sample_count=50000, notes="augmented"),
])

print("Tables:", backend.list_tables())
print("\nSchema:", backend.describe_table("example"))

Tables: ['example']

Schema: {'run_uid': 'String', 'created_at': "Datetime(time_unit='us', time_zone='UTC')", 'dataset_id': 'String', 'score': 'Float64', 'sample_count': 'Int64', 'notes': 'String'}

In [5]:

Copied!

# SQL queries work directly
df = backend.query_sql("SELECT dataset_id, score, notes FROM example ORDER BY score DESC")
print(df)
# SQL queries work directly
df = backend.query_sql("SELECT dataset_id, score, notes FROM example ORDER BY score DESC")
print(df)

shape: (3, 3)
┌────────────┬───────┬───────────┐
│ dataset_id ┆ score ┆ notes     │
│ ---        ┆ ---   ┆ ---       │
│ str        ┆ f64   ┆ str       │
╞════════════╪═══════╪═══════════╡
│ mnist      ┆ 0.98  ┆ null      │
│ cifar10    ┆ 0.94  ┆ augmented │
│ cifar10    ┆ 0.92  ┆ null      │
└────────────┴───────┴───────────┘

In [6]:

Copied!





# Aggregation across runs for the same dataset
df = backend.query_sql("""
    SELECT dataset_id, COUNT(*) AS run_count, AVG(score) AS avg_score
    FROM example
    GROUP BY dataset_id
""")
print(df)
# Aggregation across runs for the same dataset
df = backend.query_sql("""
    SELECT dataset_id, COUNT(*) AS run_count, AVG(score) AS avg_score
    FROM example
    GROUP BY dataset_id
""")
print(df)

shape: (2, 3)
┌────────────┬───────────┬───────────┐
│ dataset_id ┆ run_count ┆ avg_score │
│ ---        ┆ ---       ┆ ---       │
│ str        ┆ u32       ┆ f64       │
╞════════════╪═══════════╪═══════════╡
│ mnist      ┆ 1         ┆ 0.98      │
│ cifar10    ┆ 2         ┆ 0.93      │
└────────────┴───────────┴───────────┘

In [7]:

Copied!





# Idempotent writes — writing the same run_uid again is a no-op
backend.write([
    ExampleRecord(run_uid="run_1", dataset_id="cifar10", score=0.92, sample_count=50000),
])

count = backend.query_sql("SELECT COUNT(*) AS n FROM example")
print(f"Still 3 rows (not 4): {count}")
# Idempotent writes — writing the same run_uid again is a no-op
backend.write([
    ExampleRecord(run_uid="run_1", dataset_id="cifar10", score=0.92, sample_count=50000),
])

count = backend.query_sql("SELECT COUNT(*) AS n FROM example")
print(f"Still 3 rows (not 4): {count}")

Still 3 rows (not 4): shape: (1, 1)
┌─────┐
│ n   │
│ --- │
│ u32 │
╞═════╡
│ 3   │
└─────┘

Without the store, getting the same result would require keeping every capability run result in memory, filtering by capability type and dataset, extracting the metric value, and aggregating manually. The store makes this a one-line SQL query. Additionally, the Parquet backend leverages columnar storage — queries that touch only a subset of columns read only those columns from disk, avoiding full-file scans.

Schema evolution¶

The Parquet backend supports adding and removing fields over time. This is a property of how the backend reads Parquet files: it uses Polars' diagonal_relaxed concatenation, which aligns columns by name and fills missing columns with null. Future backend implementations should provide equivalent behaviour.

Adding a field: Old data gets None for the new column.
Removing a field: Old data retains the column; new records simply don't populate it.
Renaming or changing types: Not supported — requires manual migration.

In [8]:

Copied!





# Simulate schema evolution: add an "augmented" boolean field
class ExampleRecordV2(BaseRecord, table_name="example"):  # Same table name!

    dataset_id: str
    score: float
    sample_count: int
    notes: str | None = None
    augmented: bool | None = None  # New field


backend.write([
    ExampleRecordV2(run_uid="run_4", dataset_id="svhn", score=0.89, sample_count=73000, augmented=True),
])

# Old rows have None for 'augmented'; new row has the value
df = backend.query_sql("SELECT dataset_id, score, augmented FROM example ORDER BY dataset_id")
print(df)
# Simulate schema evolution: add an "augmented" boolean field
class ExampleRecordV2(BaseRecord, table_name="example"):  # Same table name!

    dataset_id: str
    score: float
    sample_count: int
    notes: str | None = None
    augmented: bool | None = None  # New field


backend.write([
    ExampleRecordV2(run_uid="run_4", dataset_id="svhn", score=0.89, sample_count=73000, augmented=True),
])

# Old rows have None for 'augmented'; new row has the value
df = backend.query_sql("SELECT dataset_id, score, augmented FROM example ORDER BY dataset_id")
print(df)

shape: (4, 3)
┌────────────┬───────┬───────────┐
│ dataset_id ┆ score ┆ augmented │
│ ---        ┆ ---   ┆ ---       │
│ str        ┆ f64   ┆ bool      │
╞════════════╪═══════╪═══════════╡
│ cifar10    ┆ 0.92  ┆ null      │
│ cifar10    ┆ 0.94  ┆ null      │
│ mnist      ┆ 0.98  ┆ null      │
│ svhn       ┆ 0.89  ┆ true      │
└────────────┴───────┴───────────┘

The full workflow: `store.write(runs)`¶

In normal usage, you don't create records manually. If you're writing a new capability, you run the capability, then pass the run objects to store.write(). The store calls each run's extract() method and auto-populates the runs metadata table.

Let's trace what happens step by step.

Step 1: `extract()` projects capability outputs into flat records¶

Each CapabilityRunBase subclass implements extract() to select and aggregate the fields from its outputs that are useful for cross-run comparison. extract() produces a curated summary, not a full serialisation.

DataevalCleaningRun.extract() returns a single record per run. The full output contains raw per-image arrays (widths, heights, brightness values, outlier dictionaries, etc.), but the record distils these into ~20 aggregate scalars:

DataevalCleaningRecord(
    run_uid=self.run_uid,
    dataset_id="cifar10",
    exact_duplicate_count=12,
    exact_duplicate_ratio=0.00024,
    image_outlier_count=47,
    image_outlier_ratio=0.00094,
    mean_width=480.0,
    mean_brightness=0.48,
    class_imbalance_ratio=1.0,
    ...  # ~10 more scalar fields
)

MaiteEvaluationRun.extract() returns multiple records — one per output of Metric.compute(). The full output stores all metrics in a single dict; the store unpacks them into separate records for SQL filtering:

[
    MaiteEvaluationRecord(run_uid=..., dataset_id="cifar10", model_id="resnet50",
                          metric_id="coco_metrics", output_key="map50",
                          output_value=0.45, scope="overall"),
    MaiteEvaluationRecord(run_uid=..., dataset_id="cifar10", model_id="resnet50",
                          metric_id="coco_metrics", output_key="map75",
                          output_value=0.32, scope="overall"),
    # Plus one record per class if class_metrics were computed:
    MaiteEvaluationRecord(run_uid=..., dataset_id="cifar10", model_id="resnet50",
                          metric_id="coco_metrics", output_key="map50",
                          output_value=0.52, scope="class", class_name="person"),
    MaiteEvaluationRecord(run_uid=..., dataset_id="cifar10", model_id="resnet50",
                          metric_id="coco_metrics", output_key="map50",
                          output_value=0.38, scope="class", class_name="car"),
]

DataevalFeasibilityRun.extract() returns a single record per run. The IC variant populates BER bounds only; the OD variant also includes instance counts and dataset health statistics:

# IC run (simple — two scalar BER bounds)
DataevalFeasibilityRecord(
    run_uid=self.run_uid,
    dataset_id="cifar10",
    ber_upper=0.15,
    ber_lower=0.08,
)

# OD run (includes health stats)
DataevalFeasibilityRecord(
    run_uid=self.run_uid,
    dataset_id="coco-val",
    ber_upper=0.25,
    ber_lower=0.12,
    num_instances=500,
    num_classes=10,
    small_object_ratio=0.05,
    truncated_bbox_ratio=0.03,
    overlap_image_ratio=0.02,
    health_warning_count=0,
)

DataevalShiftRun.extract() returns a single record per run. Shift is a two-dataset capability, so it uses reference_dataset_id and evaluation_dataset_id instead of the single dataset_id convention. Drift test results are stored as direct scalars; OOD per-sample arrays are aggregated into summary statistics:

DataevalShiftRecord(
    run_uid=self.run_uid,
    reference_dataset_id="coco-train",
    evaluation_dataset_id="coco-val",
    # Drift: 3 tests x (drifted, distance, p_val, threshold) + per-feature counts for CVM/KS
    mmd_drifted=True,
    mmd_distance=0.45,
    mmd_p_val=0.01,
    mmd_threshold=0.05,
    cvm_drifted=True,
    cvm_distance=0.38,
    cvm_p_val=0.02,
    cvm_threshold=0.005,
    cvm_feature_drift_count=12,
    ks_drifted=False,
    ks_distance=0.12,
    ks_p_val=0.15,
    ks_threshold=0.005,
    ks_feature_drift_count=3,
    # OOD: aggregated from per-sample arrays
    ood_count=3,
    ood_total=50,
    ood_ratio=0.06,
    ood_mean_instance_score=0.72,
    ood_std_instance_score=0.15,
    ood_max_instance_score=1.05,
)

NrtkRobustnessRun.extract() returns multiple records — one per (theta_value, metric_key) pair. This is the same Entity-Attribute-Value pattern used by MaiteEvaluationRun. The is_primary flag marks rows for the capability's return_key metric:

[
    NrtkRobustnessRecord(
        run_uid=self.run_uid,
        dataset_id="cifar10",
        model_id="resnet50",
        metric_id="coco_metrics",
        perturber_class="BrightnessPerturber",
        perturber_type="Brightness Perturber",
        theta_key="factor",
        theta_index=0,
        theta_value=1.0,
        metric_key="accuracy",
        metric_value=0.95,
        is_primary=True,
    ),
    NrtkRobustnessRecord(
        run_uid=self.run_uid,
        dataset_id="cifar10",
        model_id="resnet50",
        metric_id="coco_metrics",
        perturber_class="BrightnessPerturber",
        perturber_type="Brightness Perturber",
        theta_key="factor",
        theta_index=0,
        theta_value=1.0,
        metric_key="f1_score",
        metric_value=0.90,
        is_primary=False,
    ),
    # ... one record per (theta, metric_key) pair
]

Step 2: The `runs` table is auto-populated¶

When you call store.write([run1, run2]), the store also writes rows to the runs table — one row per (run_uid, entity_type, entity_id) combination:

run_uid	capability_id	capability_table	entity_type	entity_id
a1b2...	checkmaite.core.DataevalCleaning	dataeval_cleaning	dataset	cifar10
c3d4...	checkmaite.core.MaiteEvaluation	maite_evaluation	dataset	cifar10
c3d4...	checkmaite.core.MaiteEvaluation	maite_evaluation	model	resnet50
c3d4...	checkmaite.core.MaiteEvaluation	maite_evaluation	metric	map50

This table enables cross-capability queries filtered by any entity:

-- Find all capability tables that have results for a specific dataset
SELECT DISTINCT capability_table
FROM runs
WHERE entity_type = 'dataset' AND entity_id = 'cifar10'

Querying across capability runs¶

The two primary query patterns are:

1. Direct JOIN via dataset_id (single-dataset capabilities)

Both DataevalCleaningRecord and MaiteEvaluationRecord include a dataset_id field. This enables direct joins:

-- Correlate dataset quality with model accuracy
SELECT
    d.dataset_id,
    d.exact_duplicate_ratio,
    d.image_outlier_ratio,
    m.output_value AS accuracy
FROM dataeval_cleaning d
JOIN maite_evaluation m ON d.dataset_id = m.dataset_id
WHERE m.output_key = 'accuracy' AND m.scope = 'overall'

2. JOIN via the runs table (general case)

When you need to filter by model, metric, or any other entity:

-- Get all evaluation results for a specific model
SELECT e.*
FROM maite_evaluation e
JOIN runs r ON e.run_uid = r.run_uid
WHERE r.entity_type = 'model' AND r.entity_id = 'resnet50'

3. Correlate feasibility with dataset quality

-- Compare BER with cleaning metrics for each dataset
SELECT
    f.dataset_id,
    f.ber_upper,
    f.ber_lower,
    c.exact_duplicate_ratio,
    c.image_outlier_ratio
FROM dataeval_feasibility f
JOIN dataeval_cleaning c ON f.dataset_id = c.dataset_id

Correlate drift with dataset feasibility

-- Compare drift detection with BER for the reference dataset
SELECT
    s.reference_dataset_id,
    s.mmd_drifted,
    s.mmd_p_val,
    f.ber_upper,
    f.ber_lower
FROM dataeval_shift s
JOIN dataeval_feasibility f ON s.reference_dataset_id = f.dataset_id

Query robustness curves alongside dataset quality

-- Correlate model robustness with dataset cleaning metrics
SELECT
    n.model_id,
    n.perturber_type,
    MIN(n.metric_value) AS worst_score,
    c.image_outlier_ratio
FROM nrtk_robustness n
JOIN dataeval_cleaning c ON n.dataset_id = c.dataset_id
WHERE n.is_primary = true
GROUP BY n.model_id, n.perturber_type, c.image_outlier_ratio

Using the store from Python (Polars DataFrames)¶

query_sql() returns a Polars DataFrame, so you can chain SQL with Polars operations:

In [9]:

Copied!

# SQL for filtering, Polars for transformation
df = backend.query_sql("SELECT * FROM example WHERE dataset_id = 'cifar10'")

# Continue with Polars API
print(df.select("score").describe())
# SQL for filtering, Polars for transformation
df = backend.query_sql("SELECT * FROM example WHERE dataset_id = 'cifar10'")

# Continue with Polars API
print(df.select("score").describe())

shape: (9, 2)
┌────────────┬──────────┐
│ statistic  ┆ score    │
│ ---        ┆ ---      │
│ str        ┆ f64      │
╞════════════╪══════════╡
│ count      ┆ 2.0      │
│ null_count ┆ 0.0      │
│ mean       ┆ 0.93     │
│ std        ┆ 0.014142 │
│ min        ┆ 0.92     │
│ 25%        ┆ 0.92     │
│ 50%        ┆ 0.94     │
│ 75%        ┆ 0.94     │
│ max        ┆ 0.94     │
└────────────┴──────────┘

Using the store from external SQL tools¶

Note: This section is specific to the ParquetBackend. Other backends (e.g. a future DuckDB or SQL database backend) would provide their own native access paths.

Because the Parquet backend writes plain Parquet files with only scalar columns, any tool that reads Parquet can query the store directly — no Python required.

DuckDB (CLI or any SQL client):

-- Point DuckDB at the store directory
SELECT * FROM read_parquet('./analytics_store/dataeval_cleaning/*.parquet');

-- Cross-capability join
SELECT d.dataset_id, d.exact_duplicate_ratio, m.output_value
FROM read_parquet('./analytics_store/dataeval_cleaning/*.parquet') d
JOIN read_parquet('./analytics_store/maite_evaluation/*.parquet') m
  ON d.dataset_id = m.dataset_id
WHERE m.output_key = 'accuracy';

The store files are self-contained Parquet with native types — any Parquet reader in any language can consume them.

Part 3: Extending the Store¶

Adding storage support to a new capability¶

To make a capability's results queryable via the store, you need two things:

A BaseRecord subclass defining the table schema.
An extract() method on the capability's run class.

Here's the template:

In [10]:

Copied!





from checkmaite.core.analytics_store import BaseRecord


# Step 1: Define the record class
class MyCapabilityRecord(BaseRecord, table_name="my_capability"):

    # Convention: include dataset_id for cross-capability JOINs
    dataset_id: str

    # Capability-specific fields (all must be scalar)
    primary_metric: float
    sample_count: int
    status: str  # e.g. "pass" or "fail"


# Step 2: Override extract() on the run class
# (shown as pseudocode — in practice this goes on your CapabilityRunBase subclass)
#
# def extract(self) -> list[MyCapabilityRecord]:
#     return [
#         MyCapabilityRecord(
#             run_uid=self.run_uid,
#             dataset_id=self.dataset_metadata[0]["id"],
#             primary_metric=self.outputs.some_value,
#             sample_count=len(self.outputs.results),
#             status="pass" if self.outputs.some_value > 0.9 else "fail",
#         )
#     ]

print("Record class is valid:", MyCapabilityRecord.table_name)
from checkmaite.core.analytics_store import BaseRecord


# Step 1: Define the record class
class MyCapabilityRecord(BaseRecord, table_name="my_capability"):

    # Convention: include dataset_id for cross-capability JOINs
    dataset_id: str

    # Capability-specific fields (all must be scalar)
    primary_metric: float
    sample_count: int
    status: str  # e.g. "pass" or "fail"


# Step 2: Override extract() on the run class
# (shown as pseudocode — in practice this goes on your CapabilityRunBase subclass)
#
# def extract(self) -> list[MyCapabilityRecord]:
#     return [
#         MyCapabilityRecord(
#             run_uid=self.run_uid,
#             dataset_id=self.dataset_metadata[0]["id"],
#             primary_metric=self.outputs.some_value,
#             sample_count=len(self.outputs.results),
#             status="pass" if self.outputs.some_value > 0.9 else "fail",
#         )
#     ]

print("Record class is valid:", MyCapabilityRecord.table_name)

Record class is valid: my_capability

Design guidance for extract():

Summarise, don't serialise. The record should contain what an analyst needs to filter, group, and compare — not a dump of the full output.
One record per logical entity for fixed-schema outputs (e.g. DataevalCleaning returns one record per dataset).
One record per variable-length item for EAV-style outputs (e.g. MaiteEvaluation returns one record per metric output).
Include dataset_id if the capability operates on a single dataset. This is the primary JOIN key across capabilities.
Use Optional for fields that may not always be present (e.g. target outliers only exist for object detection).

Implementing a custom `StorageBackend`¶

The StorageBackend protocol has four methods. Any class implementing them can replace ParquetBackend:

from collections.abc import Sequence
import polars as pl
from checkmaite.core.analytics_store import BaseRecord, StorageBackend


class DuckDBBackend:
    """Example: a DuckDB-backed storage backend."""

    def __init__(self, db_path: str) -> None:
        import duckdb
        self.conn = duckdb.connect(db_path)

    def write(self, records: Sequence[BaseRecord]) -> None:
        # Group by table, create table if needed, INSERT
        ...

    def list_tables(self) -> list[str]:
        # SELECT table_name FROM information_schema.tables
        ...

    def describe_table(self, table_name: str) -> dict[str, str]:
        # DESCRIBE {table_name}
        ...

    def query_sql(self, sql: str) -> pl.DataFrame:
        # Execute SQL, return as Polars DataFrame
        return self.conn.execute(sql).pl()


# Usage is identical:
# store = AnalyticsStore(DuckDBBackend("./analytics.duckdb"))
# store.write([run1, run2])
# store.query_sql("SELECT ...")

This is the intended scale pathway. The Parquet backend is the starting point; DuckDB, Delta Lake, or a SQL database is the destination when you need transactions, concurrency, or better query performance.

Part 4: File layout and external access¶

The Parquet backend produces this directory structure:

analytics_store/
  dataeval_cleaning/
    1706000000000_a1b2c3d4.parquet
    1706000060000_e5f6a7b8.parquet
  maite_evaluation/
    1706000000000_c9d0e1f2.parquet
  runs/
    1706000000000_g3h4i5j6.parquet

Each write() call creates one file per table. File names are {timestamp_ms}_{uuid_8char}.parquet for uniqueness and chronological sorting.

The files are plain Parquet with scalar columns only — no custom metadata, no manifest files, no lock files, no binary references. Any tool that reads Parquet (DuckDB, Spark, pandas, pyarrow, Snowflake external tables) can read them directly.

In [11]:

Copied!

# Inspect the files on disk
for p in sorted(Path(store_dir).rglob("*.parquet")):
    print(f"  {p.relative_to(store_dir)}  ({p.stat().st_size:,} bytes)")
# Inspect the files on disk
for p in sorted(Path(store_dir).rglob("*.parquet")):
    print(f"  {p.relative_to(store_dir)}  ({p.stat().st_size:,} bytes)")

  example/1779474735838_0ec8e733.parquet  (2,337 bytes)
  example/1779474735871_078c806a.parquet  (2,365 bytes)

Part 5: Record Schema Reference¶

Each capability that supports the analytics store defines a record class with scalar-only fields. All records share two common fields from BaseRecord:

Field	Type	Description
`run_uid`	str	SHA-256 hash linking to the capability run
`created_at`	datetime	Timestamp when the record was created (auto-generated)

Use store.describe_table("table_name") at runtime to inspect the schema of any table, or store.list_tables() to see which tables have data.

`dataeval_cleaning`¶

One record per dataset. Summarises dataset quality: duplicates, outliers, visual properties, and class balance.

Field	Type	Description
`dataset_id`	str	Dataset identifier (cross-capability JOIN key)
`exact_duplicate_count`	int	Number of exact duplicate images
`exact_duplicate_ratio`	float	Fraction of exact duplicates
`near_duplicate_count`	int	Number of near-duplicate images
`near_duplicate_ratio`	float	Fraction of near duplicates
`image_outlier_count`	int	Number of image-level outliers
`image_outlier_ratio`	float	Fraction of image outliers
`class_count`	int	Number of unique classes
`label_count`	int	Total label count across all images
`image_count`	int	Total number of images
`target_outlier_count`	int \| None	Target-level outlier count (OD only)
`target_outlier_ratio`	float \| None	Fraction of target outliers (OD only)
`mean_width`	float	Mean image width
`mean_height`	float	Mean image height
`std_aspect_ratio`	float	Standard deviation of aspect ratios
`mean_brightness`	float	Mean image brightness
`mean_contrast`	float	Mean image contrast
`mean_sharpness`	float	Mean image sharpness
`class_imbalance_ratio`	float	Ratio of largest to smallest class
`min_class_image_count`	int	Smallest class size
`max_class_image_count`	int	Largest class size
`mean_labels_per_image`	float	Average labels per image

`dataeval_bias`¶

One record per dataset. Summarises coverage, balance, and diversity metrics. Balance and diversity fields are None when the dataset has no usable metadata factors.

Field	Type	Description
`dataset_id`	str	Dataset identifier (cross-capability JOIN key)
`coverage_total`	int	Total number of images
`coverage_uncovered_count`	int	Number of under-represented images
`coverage_uncovered_ratio`	float	Fraction of uncovered images
`coverage_radius`	float	Coverage radius used for detection
`balance_num_factors`	int \| None	Number of metadata factors analysed
`balance_mean`	float \| None	Mean balance score across factors
`balance_max`	float \| None	Maximum balance score
`balance_factors_above_05`	int \| None	Factors with balance >= 0.5
`diversity_num_factors`	int \| None	Number of diversity factors
`diversity_mean`	float \| None	Mean diversity index
`diversity_min`	float \| None	Minimum diversity index
`diversity_factors_below_04`	int \| None	Factors with diversity < 0.4

`maite_evaluation`¶

One record per (output_key, scope) pair. Stores metric results in Entity-Attribute-Value format, with optional per-class breakdown.

Field	Type	Description
`dataset_id`	str	Dataset identifier (cross-capability JOIN key)
`model_id`	str	Model identifier
`metric_id`	str	Metric identifier
`output_key`	str	Metric output key (e.g., "accuracy", "map50")
`output_value`	float	Metric value
`scope`	str	"overall" or "class"
`class_name`	str \| None	Class name (when scope is "class")

`dataeval_feasibility`¶

One record per dataset. Stores Bayes Error Rate bounds. OD-specific health stats are None for IC runs.

Field	Type	Description
`dataset_id`	str	Dataset identifier (cross-capability JOIN key)
`ber_upper`	float	Upper bound on Bayes Error Rate
`ber_lower`	float	Lower bound on Bayes Error Rate
`num_instances`	int \| None	Total valid instance crops (OD only)
`num_classes`	int \| None	Number of unique classes (OD only)
`small_object_ratio`	float \| None	Fraction of small objects (OD only)
`truncated_bbox_ratio`	float \| None	Fraction of boundary-touching boxes (OD only)
`overlap_image_ratio`	float \| None	Fraction of images with high-IoU box pairs (OD only)
`health_warning_count`	int \| None	Number of health warnings (OD only)

`dataeval_shift`¶

One record per run. Stores drift detection and OOD summary metrics. Uses two dataset IDs (reference and evaluation) instead of the single dataset_id convention.

Field	Type	Description
`reference_dataset_id`	str	Reference (baseline) dataset identifier
`evaluation_dataset_id`	str	Evaluation (test) dataset identifier
`mmd_drifted`	bool	Whether MMD detected drift
`mmd_distance`	float	MMD test statistic
`mmd_p_val`	float	MMD p-value
`mmd_threshold`	float	MMD significance threshold
`cvm_drifted`	bool	Whether CVM detected drift
`cvm_distance`	float	CVM mean test statistic
`cvm_p_val`	float	CVM combined p-value
`cvm_threshold`	float	CVM significance threshold
`cvm_feature_drift_count`	int	Number of individually drifted features (CVM)
`ks_drifted`	bool	Whether KS detected drift
`ks_distance`	float	KS mean test statistic
`ks_p_val`	float	KS combined p-value
`ks_threshold`	float	KS significance threshold
`ks_feature_drift_count`	int	Number of individually drifted features (KS)
`ood_count`	int	Number of OOD samples in evaluation set
`ood_total`	int	Total samples in evaluation set
`ood_ratio`	float	Fraction of OOD samples
`ood_mean_instance_score`	float	Mean OOD instance score
`ood_std_instance_score`	float	Std dev of OOD instance scores
`ood_max_instance_score`	float	Maximum OOD instance score

`nrtk_robustness`¶

One record per (theta_value, metric_key) pair. Stores per-perturbation-point metric values in Entity-Attribute-Value format, enabling full robustness curve reconstruction via SQL.

Field	Type	Description
`dataset_id`	str	Dataset identifier (cross-capability JOIN key)
`model_id`	str	Model identifier
`metric_id`	str	Metric identifier
`perturber_class`	str	Perturber class name (e.g., "BrightnessPerturber")
`perturber_type`	str	Human-readable perturber label (e.g., "Brightness Perturber")
`theta_key`	str	Perturbation parameter name (e.g., "factor", "ksize")
`theta_index`	int	Ordinal position in the sweep (0-based)
`theta_value`	float	Parameter value at this perturbation level
`metric_key`	str	Metric output key (e.g., "accuracy", "f1_score")
`metric_value`	float	Score at this perturbation level
`is_primary`	bool	True when metric_key matches the capability's return_key

Summary¶

Aspect	Detail
Purpose	Query and compare capability results across runs
Content	Curated scalar summaries from `extract()`
Format	Scalar columns only (Parquet by default; pluggable via `StorageBackend`)
Query interface	SQL, Polars, any Parquet reader
Cross-run queries	Native (`GROUP BY`, `JOIN`, `WHERE`)
Non-Python access	Any Parquet-capable tool (DuckDB, Spark, etc.)
Populated by	`store.write([run1, ...])` (explicit)
Deduplicated by	`run_uid` (on write)
Extensibility	`StorageBackend` protocol — swap Parquet for DuckDB, Delta Lake, Postgres, etc.