Analytics Store Tutorial¶

A guide to saving and querying capability results with the checkmaite analytics store.

NOTE: This tutorial demonstrates the analytics store using object detection (OD) datasets and the DataevalCleaning capability. The analytics store works identically for image classification (IC) capabilities.

This tutorial uses direct local writes (store.write([run])). For distributed job submission, the responsibilities split:

the client chooses the durable store location,
the worker needs that information to persist results,
and the client later reads from that same location.

So distributed runs require explicit configuration:

from checkmaite.jobs import configure_job_backend

configure_job_backend(
    "ray",
    analytics_store={"backend": "parquet", "uri": "./analytics_store"},
)

checkmaite.jobs._store defines AnalyticsStoreConfig; the job backend forwards this config on submissions so workers write to the intended store. See Analytics store in distributed execution and Job backend configuration.

What is the Analytics Store?¶

When you run a capability (such as DataevalCleaning or DataevalBias), the results are Python objects in memory. They disappear when your notebook restarts. The analytics store solves this by:

Persisting results via a pluggable storage backend (Parquet by default) so they survive notebook restarts
Enabling SQL queries across runs, datasets, and capabilities
Supporting external tools — with the default Parquet backend, DuckDB, Spark, pandas, or any Parquet reader can query the same files

Each capability extracts scalar summary metrics from its rich outputs (e.g., duplicate count, outlier ratio) into flat records that map directly to SQL rows.

The store manages two kinds of tables:

Capability tables (e.g., dataeval_cleaning) — one per capability, containing the extracted metrics
runs table — auto-populated metadata mapping each run to its datasets, models, and metrics

Setup¶

In [1]:

Copied!





import tempfile
from pathlib import Path

from checkmaite.core.analytics_store import AnalyticsStore, ParquetBackend
from checkmaite.core.object_detection import DataevalCleaning
from checkmaite.core.object_detection.dataset_loaders import CocoDetectionDataset

# Create a store backed by a temporary directory
# In practice, use a persistent path like "./analytics_store"
store_dir = tempfile.mkdtemp(prefix="analytics_store_guide_")
store = AnalyticsStore(ParquetBackend(store_dir))
print(f"Store created at: {store_dir}")
import tempfile
from pathlib import Path

from checkmaite.core.analytics_store import AnalyticsStore, ParquetBackend
from checkmaite.core.object_detection import DataevalCleaning
from checkmaite.core.object_detection.dataset_loaders import CocoDetectionDataset

# Create a store backed by a temporary directory
# In practice, use a persistent path like "./analytics_store"
store_dir = tempfile.mkdtemp(prefix="analytics_store_guide_")
store = AnalyticsStore(ParquetBackend(store_dir))
print(f"Store created at: {store_dir}")

Store created at: /tmp/analytics_store_guide_sv6g64lo

/home/runner/work/checkmaite/checkmaite/.venv/lib/python3.10/site-packages/xaitk_saliency/__init__.py:3: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources

Running a Capability and Saving Results¶

The typical workflow is: load a dataset, run a capability, then write the results to the store.

We use the CocoDetectionDataset wrapper to load a small subset of the COCO 2017 dataset included in the repository's test data.

In [2]:

Copied!





# Load the example COCO dataset
BASE_DIR = Path.cwd().parents[1]
dataset_root = BASE_DIR / "tests/data_for_tests/coco_dataset"
dataset_ann = dataset_root / "ann_file.json"

dataset = CocoDetectionDataset(
    root=str(dataset_root),
    ann_file=str(dataset_ann),
    dataset_id="coco-example",
)
print(f"Loaded dataset: {dataset.metadata['id']} ({len(dataset)} images)")

# Run DataevalCleaning
capability = DataevalCleaning()
run = capability.run(datasets=[dataset], use_cache=False)

# Write to the store
store.write([run])

# Inspect what was written
# list_tables() shows which capabilities have written results
print(f"\nTables: {store.list_tables()}")

# describe_table() shows the available columns and types — useful for writing SQL queries
print(f"\nSchema for 'dataeval_cleaning':")
for col, dtype in store.describe_table("dataeval_cleaning").items():
    print(f"  {col}: {dtype}")
# Load the example COCO dataset
BASE_DIR = Path.cwd().parents[1]
dataset_root = BASE_DIR / "tests/data_for_tests/coco_dataset"
dataset_ann = dataset_root / "ann_file.json"

dataset = CocoDetectionDataset(
    root=str(dataset_root),
    ann_file=str(dataset_ann),
    dataset_id="coco-example",
)
print(f"Loaded dataset: {dataset.metadata['id']} ({len(dataset)} images)")

# Run DataevalCleaning
capability = DataevalCleaning()
run = capability.run(datasets=[dataset], use_cache=False)

# Write to the store
store.write([run])

# Inspect what was written
# list_tables() shows which capabilities have written results
print(f"\nTables: {store.list_tables()}")

# describe_table() shows the available columns and types — useful for writing SQL queries
print(f"\nSchema for 'dataeval_cleaning':")
for col, dtype in store.describe_table("dataeval_cleaning").items():
    print(f"  {col}: {dtype}")

Loaded dataset: coco-example (4 images)

Tables: ['dataeval_cleaning', 'runs']

Schema for 'dataeval_cleaning':
  run_uid: String
  created_at: Datetime(time_unit='us', time_zone='UTC')
  dataset_id: String
  exact_duplicate_count: Int64
  exact_duplicate_ratio: Float64
  near_duplicate_count: Int64
  near_duplicate_ratio: Float64
  image_outlier_count: Int64
  image_outlier_ratio: Float64
  class_count: Int64
  label_count: Int64
  image_count: Int64
  target_outlier_count: Int64
  target_outlier_ratio: Float64
  mean_width: Float64
  mean_height: Float64
  std_aspect_ratio: Float64
  mean_brightness: Float64
  mean_contrast: Float64
  mean_sharpness: Float64
  class_imbalance_ratio: Float64
  min_class_image_count: Int64
  max_class_image_count: Int64
  mean_labels_per_image: Float64

Querying Results via SQL¶

The store exposes a query_sql() method that accepts standard SQL and returns a Polars DataFrame.

In [3]:

Copied!

# View all cleaning records
df = store.query_sql("SELECT * FROM dataeval_cleaning")
print(df)
# View all cleaning records
df = store.query_sql("SELECT * FROM dataeval_cleaning")
print(df)

shape: (1, 24)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ run_uid   ┆ created_a ┆ dataset_i ┆ exact_dup ┆ … ┆ class_imb ┆ min_class ┆ max_class ┆ mean_lab │
│ ---       ┆ t         ┆ d         ┆ licate_co ┆   ┆ alance_ra ┆ _image_co ┆ _image_co ┆ els_per_ │
│ str       ┆ ---       ┆ ---       ┆ unt       ┆   ┆ tio       ┆ unt       ┆ unt       ┆ image    │
│           ┆ datetime[ ┆ str       ┆ ---       ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---      │
│           ┆ μs, UTC]  ┆           ┆ i64       ┆   ┆ f64       ┆ i64       ┆ i64       ┆ f64      │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡
│ 1f20f61b0 ┆ 2026-05-2 ┆ coco-exam ┆ 0         ┆ … ┆ 3.0       ┆ 1         ┆ 3         ┆ 14.25    │
│ 72156430e ┆ 2 18:40:2 ┆ ple       ┆           ┆   ┆           ┆           ┆           ┆          │
│ 1bfc465ba ┆ 1.156592  ┆           ┆           ┆   ┆           ┆           ┆           ┆          │
│ a1c…      ┆ UTC       ┆           ┆           ┆   ┆           ┆           ┆           ┆          │
└───────────┴───────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴──────────┘

In [4]:

Copied!





# Select specific fields
df = store.query_sql("""
    SELECT
        dataset_id,
        exact_duplicate_count,
        exact_duplicate_ratio,
        image_outlier_count,
        image_outlier_ratio,
        mean_brightness
    FROM dataeval_cleaning
""")
print(df)
# Select specific fields
df = store.query_sql("""
    SELECT
        dataset_id,
        exact_duplicate_count,
        exact_duplicate_ratio,
        image_outlier_count,
        image_outlier_ratio,
        mean_brightness
    FROM dataeval_cleaning
""")
print(df)

shape: (1, 6)
┌──────────────┬────────────────┬────────────────┬────────────────┬────────────────┬───────────────┐
│ dataset_id   ┆ exact_duplicat ┆ exact_duplicat ┆ image_outlier_ ┆ image_outlier_ ┆ mean_brightne │
│ ---          ┆ e_count        ┆ e_ratio        ┆ count          ┆ ratio          ┆ ss            │
│ str          ┆ ---            ┆ ---            ┆ ---            ┆ ---            ┆ ---           │
│              ┆ i64            ┆ f64            ┆ i64            ┆ f64            ┆ f64           │
╞══════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═══════════════╡
│ coco-example ┆ 0              ┆ 0.0            ┆ 0              ┆ 0.0            ┆ 0.305882      │
└──────────────┴────────────────┴────────────────┴────────────────┴────────────────┴───────────────┘

Comparing Across Runs¶

Every run gets a unique run_uid and the store deduplicates by this key. You can run the same capability on different datasets and compare results side by side.

The store also auto-populates a runs table that maps each run_uid to its datasets, models, and metrics.

In [5]:

Copied!





# Run cleaning on a second dataset
dataset_root_2 = BASE_DIR / "tests/data_for_tests/coco_resized_val2017"
dataset_ann_2 = dataset_root_2 / "instances_val2017_resized_6.json"

dataset_2 = CocoDetectionDataset(
    root=str(dataset_root_2),
    ann_file=str(dataset_ann_2),
    dataset_id="coco-resized",
)

run_2 = capability.run(datasets=[dataset_2], use_cache=False)
store.write([run_2])

# Compare cleaning results across both datasets
df = store.query_sql("""
    SELECT
        dataset_id,
        exact_duplicate_count,
        image_outlier_count,
        image_outlier_ratio,
        mean_brightness
    FROM dataeval_cleaning
    ORDER BY dataset_id
""")
print(df)
# Run cleaning on a second dataset
dataset_root_2 = BASE_DIR / "tests/data_for_tests/coco_resized_val2017"
dataset_ann_2 = dataset_root_2 / "instances_val2017_resized_6.json"

dataset_2 = CocoDetectionDataset(
    root=str(dataset_root_2),
    ann_file=str(dataset_ann_2),
    dataset_id="coco-resized",
)

run_2 = capability.run(datasets=[dataset_2], use_cache=False)
store.write([run_2])

# Compare cleaning results across both datasets
df = store.query_sql("""
    SELECT
        dataset_id,
        exact_duplicate_count,
        image_outlier_count,
        image_outlier_ratio,
        mean_brightness
    FROM dataeval_cleaning
    ORDER BY dataset_id
""")
print(df)

shape: (2, 5)
┌──────────────┬─────────────────────┬─────────────────────┬─────────────────────┬─────────────────┐
│ dataset_id   ┆ exact_duplicate_cou ┆ image_outlier_count ┆ image_outlier_ratio ┆ mean_brightness │
│ ---          ┆ nt                  ┆ ---                 ┆ ---                 ┆ ---             │
│ str          ┆ ---                 ┆ i64                 ┆ f64                 ┆ f64             │
│              ┆ i64                 ┆                     ┆                     ┆                 │
╞══════════════╪═════════════════════╪═════════════════════╪═════════════════════╪═════════════════╡
│ coco-example ┆ 0                   ┆ 0                   ┆ 0.0                 ┆ 0.305882        │
│ coco-resized ┆ 0                   ┆ 0                   ┆ 0.0                 ┆ 0.35098         │
└──────────────┴─────────────────────┴─────────────────────┴─────────────────────┴─────────────────┘

In [6]:

Copied!





# Check the auto-populated runs table
df_runs = store.query_sql("""
    SELECT run_uid, capability_table, entity_type, entity_id
    FROM runs
    ORDER BY entity_id
""")
print(df_runs)
# Check the auto-populated runs table
df_runs = store.query_sql("""
    SELECT run_uid, capability_table, entity_type, entity_id
    FROM runs
    ORDER BY entity_id
""")
print(df_runs)

shape: (2, 4)
┌─────────────────────────────────┬───────────────────┬─────────────┬──────────────┐
│ run_uid                         ┆ capability_table  ┆ entity_type ┆ entity_id    │
│ ---                             ┆ ---               ┆ ---         ┆ ---          │
│ str                             ┆ str               ┆ str         ┆ str          │
╞═════════════════════════════════╪═══════════════════╪═════════════╪══════════════╡
│ 1f20f61b072156430e1bfc465baa1c… ┆ dataeval_cleaning ┆ dataset     ┆ coco-example │
│ 16aa32205722d84afb23fe421301bf… ┆ dataeval_cleaning ┆ dataset     ┆ coco-resized │
└─────────────────────────────────┴───────────────────┴─────────────┴──────────────┘

Using Results Outside Python¶

With the default ParquetBackend, the store writes plain Parquet files with scalar columns only. Any tool that reads Parquet can query them directly — no Python required.

DuckDB example (run from any SQL client):

SELECT dataset_id, exact_duplicate_ratio, image_outlier_ratio
FROM read_parquet('./analytics_store/dataeval_cleaning/*.parquet')
ORDER BY image_outlier_ratio DESC;

The file layout on disk:

analytics_store/
  dataeval_cleaning/
    1706000000000_a1b2c3d4.parquet
  runs/
    1706000000000_e5f6a7b8.parquet

Each write() call creates one Parquet file per table. File names are {timestamp_ms}_{uuid}.parquet.

In [7]:

Copied!

# Show actual files on disk
for p in sorted(Path(store_dir).rglob("*.parquet")):
    print(f"  {p.relative_to(store_dir)}  ({p.stat().st_size:,} bytes)")
# Show actual files on disk
for p in sorted(Path(store_dir).rglob("*.parquet")):
    print(f"  {p.relative_to(store_dir)}  ({p.stat().st_size:,} bytes)")

  dataeval_cleaning/1779475221157_ace31ecf.parquet  (9,607 bytes)
  dataeval_cleaning/1779475221276_85896084.parquet  (9,607 bytes)
  runs/1779475221160_aa465cb2.parquet  (3,631 bytes)
  runs/1779475221279_9a2324f6.parquet  (3,631 bytes)

Next Steps¶

In this guide you learned how to:

Create an analytics store with a storage backend (Parquet by default)
Run a capability and write its results to the store
Query results via SQL and compare across datasets
Access the stored data from external tools (e.g., DuckDB, Spark)

To learn more:

Run other capabilities (DataevalBias, DataevalFeasibility, DataevalShift, MaiteEvaluation, etc.) and write their results to the same store for cross-capability SQL JOINs via dataset_id
For a complete list of available tables and their fields, see the Record Schema Reference (Part 5)
For job-submission/distributed behavior (consumer vs producer store config handoff), see Analytics store in distributed execution
For focused configure_job_backend(...) guidance, see Job backend configuration
For developer documentation on implementing extract() for new capabilities, see the Key Concepts page