Analytics Store Tutorial¶
A guide to saving and querying capability results with the checkmaite analytics store.
NOTE: This tutorial demonstrates the analytics store using object detection (OD) datasets and the
DataevalCleaningcapability. The analytics store works identically for image classification (IC) capabilities.
This tutorial uses direct local writes (store.write([run])). For distributed job submission, the responsibilities split:
- the client chooses the durable store location,
- the worker needs that information to persist results,
- and the client later reads from that same location.
So distributed runs require explicit configuration:
from checkmaite.jobs import configure_job_backend
configure_job_backend(
"ray",
analytics_store={"backend": "parquet", "uri": "./analytics_store"},
)
checkmaite.jobs._store defines AnalyticsStoreConfig; the job backend forwards this config on submissions so workers write to the intended store. See Analytics store in distributed execution and Job backend configuration.
What is the Analytics Store?¶
When you run a capability (such as DataevalCleaning or DataevalBias), the results are Python objects in memory. They disappear when your notebook restarts. The analytics store solves this by:
- Persisting results via a pluggable storage backend (Parquet by default) so they survive notebook restarts
- Enabling SQL queries across runs, datasets, and capabilities
- Supporting external tools — with the default Parquet backend, DuckDB, Spark, pandas, or any Parquet reader can query the same files
Each capability extracts scalar summary metrics from its rich outputs (e.g., duplicate count, outlier ratio) into flat records that map directly to SQL rows.
The store manages two kinds of tables:
- Capability tables (e.g.,
dataeval_cleaning) — one per capability, containing the extracted metrics runstable — auto-populated metadata mapping each run to its datasets, models, and metrics
Setup¶
import tempfile
from pathlib import Path
from checkmaite.core.analytics_store import AnalyticsStore, ParquetBackend
from checkmaite.core.object_detection import DataevalCleaning
from checkmaite.core.object_detection.dataset_loaders import CocoDetectionDataset
# Create a store backed by a temporary directory
# In practice, use a persistent path like "./analytics_store"
store_dir = tempfile.mkdtemp(prefix="analytics_store_guide_")
store = AnalyticsStore(ParquetBackend(store_dir))
print(f"Store created at: {store_dir}")
Store created at: /tmp/analytics_store_guide_sv6g64lo
/home/runner/work/checkmaite/checkmaite/.venv/lib/python3.10/site-packages/xaitk_saliency/__init__.py:3: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources
Running a Capability and Saving Results¶
The typical workflow is: load a dataset, run a capability, then write the results to the store.
We use the CocoDetectionDataset wrapper to load a small subset of the COCO 2017 dataset included in the repository's test data.
# Load the example COCO dataset
BASE_DIR = Path.cwd().parents[1]
dataset_root = BASE_DIR / "tests/data_for_tests/coco_dataset"
dataset_ann = dataset_root / "ann_file.json"
dataset = CocoDetectionDataset(
root=str(dataset_root),
ann_file=str(dataset_ann),
dataset_id="coco-example",
)
print(f"Loaded dataset: {dataset.metadata['id']} ({len(dataset)} images)")
# Run DataevalCleaning
capability = DataevalCleaning()
run = capability.run(datasets=[dataset], use_cache=False)
# Write to the store
store.write([run])
# Inspect what was written
# list_tables() shows which capabilities have written results
print(f"\nTables: {store.list_tables()}")
# describe_table() shows the available columns and types — useful for writing SQL queries
print(f"\nSchema for 'dataeval_cleaning':")
for col, dtype in store.describe_table("dataeval_cleaning").items():
print(f" {col}: {dtype}")
Loaded dataset: coco-example (4 images)
Tables: ['dataeval_cleaning', 'runs'] Schema for 'dataeval_cleaning': run_uid: String created_at: Datetime(time_unit='us', time_zone='UTC') dataset_id: String exact_duplicate_count: Int64 exact_duplicate_ratio: Float64 near_duplicate_count: Int64 near_duplicate_ratio: Float64 image_outlier_count: Int64 image_outlier_ratio: Float64 class_count: Int64 label_count: Int64 image_count: Int64 target_outlier_count: Int64 target_outlier_ratio: Float64 mean_width: Float64 mean_height: Float64 std_aspect_ratio: Float64 mean_brightness: Float64 mean_contrast: Float64 mean_sharpness: Float64 class_imbalance_ratio: Float64 min_class_image_count: Int64 max_class_image_count: Int64 mean_labels_per_image: Float64
# View all cleaning records
df = store.query_sql("SELECT * FROM dataeval_cleaning")
print(df)
shape: (1, 24) ┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐ │ run_uid ┆ created_a ┆ dataset_i ┆ exact_dup ┆ … ┆ class_imb ┆ min_class ┆ max_class ┆ mean_lab │ │ --- ┆ t ┆ d ┆ licate_co ┆ ┆ alance_ra ┆ _image_co ┆ _image_co ┆ els_per_ │ │ str ┆ --- ┆ --- ┆ unt ┆ ┆ tio ┆ unt ┆ unt ┆ image │ │ ┆ datetime[ ┆ str ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │ │ ┆ μs, UTC] ┆ ┆ i64 ┆ ┆ f64 ┆ i64 ┆ i64 ┆ f64 │ ╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡ │ 1f20f61b0 ┆ 2026-05-2 ┆ coco-exam ┆ 0 ┆ … ┆ 3.0 ┆ 1 ┆ 3 ┆ 14.25 │ │ 72156430e ┆ 2 18:40:2 ┆ ple ┆ ┆ ┆ ┆ ┆ ┆ │ │ 1bfc465ba ┆ 1.156592 ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ │ a1c… ┆ UTC ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ └───────────┴───────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴──────────┘
# Select specific fields
df = store.query_sql("""
SELECT
dataset_id,
exact_duplicate_count,
exact_duplicate_ratio,
image_outlier_count,
image_outlier_ratio,
mean_brightness
FROM dataeval_cleaning
""")
print(df)
shape: (1, 6) ┌──────────────┬────────────────┬────────────────┬────────────────┬────────────────┬───────────────┐ │ dataset_id ┆ exact_duplicat ┆ exact_duplicat ┆ image_outlier_ ┆ image_outlier_ ┆ mean_brightne │ │ --- ┆ e_count ┆ e_ratio ┆ count ┆ ratio ┆ ss │ │ str ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ ┆ i64 ┆ f64 ┆ i64 ┆ f64 ┆ f64 │ ╞══════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═══════════════╡ │ coco-example ┆ 0 ┆ 0.0 ┆ 0 ┆ 0.0 ┆ 0.305882 │ └──────────────┴────────────────┴────────────────┴────────────────┴────────────────┴───────────────┘
Comparing Across Runs¶
Every run gets a unique run_uid and the store deduplicates by this key. You can run the same capability on different datasets and compare results side by side.
The store also auto-populates a runs table that maps each run_uid to its datasets, models, and metrics.
# Run cleaning on a second dataset
dataset_root_2 = BASE_DIR / "tests/data_for_tests/coco_resized_val2017"
dataset_ann_2 = dataset_root_2 / "instances_val2017_resized_6.json"
dataset_2 = CocoDetectionDataset(
root=str(dataset_root_2),
ann_file=str(dataset_ann_2),
dataset_id="coco-resized",
)
run_2 = capability.run(datasets=[dataset_2], use_cache=False)
store.write([run_2])
# Compare cleaning results across both datasets
df = store.query_sql("""
SELECT
dataset_id,
exact_duplicate_count,
image_outlier_count,
image_outlier_ratio,
mean_brightness
FROM dataeval_cleaning
ORDER BY dataset_id
""")
print(df)
shape: (2, 5) ┌──────────────┬─────────────────────┬─────────────────────┬─────────────────────┬─────────────────┐ │ dataset_id ┆ exact_duplicate_cou ┆ image_outlier_count ┆ image_outlier_ratio ┆ mean_brightness │ │ --- ┆ nt ┆ --- ┆ --- ┆ --- │ │ str ┆ --- ┆ i64 ┆ f64 ┆ f64 │ │ ┆ i64 ┆ ┆ ┆ │ ╞══════════════╪═════════════════════╪═════════════════════╪═════════════════════╪═════════════════╡ │ coco-example ┆ 0 ┆ 0 ┆ 0.0 ┆ 0.305882 │ │ coco-resized ┆ 0 ┆ 0 ┆ 0.0 ┆ 0.35098 │ └──────────────┴─────────────────────┴─────────────────────┴─────────────────────┴─────────────────┘
# Check the auto-populated runs table
df_runs = store.query_sql("""
SELECT run_uid, capability_table, entity_type, entity_id
FROM runs
ORDER BY entity_id
""")
print(df_runs)
shape: (2, 4) ┌─────────────────────────────────┬───────────────────┬─────────────┬──────────────┐ │ run_uid ┆ capability_table ┆ entity_type ┆ entity_id │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ str │ ╞═════════════════════════════════╪═══════════════════╪═════════════╪══════════════╡ │ 1f20f61b072156430e1bfc465baa1c… ┆ dataeval_cleaning ┆ dataset ┆ coco-example │ │ 16aa32205722d84afb23fe421301bf… ┆ dataeval_cleaning ┆ dataset ┆ coco-resized │ └─────────────────────────────────┴───────────────────┴─────────────┴──────────────┘
Using Results Outside Python¶
With the default ParquetBackend, the store writes plain Parquet files with scalar columns only. Any tool that reads Parquet can query them directly — no Python required.
DuckDB example (run from any SQL client):
SELECT dataset_id, exact_duplicate_ratio, image_outlier_ratio
FROM read_parquet('./analytics_store/dataeval_cleaning/*.parquet')
ORDER BY image_outlier_ratio DESC;
The file layout on disk:
analytics_store/
dataeval_cleaning/
1706000000000_a1b2c3d4.parquet
runs/
1706000000000_e5f6a7b8.parquet
Each write() call creates one Parquet file per table. File names are {timestamp_ms}_{uuid}.parquet.
# Show actual files on disk
for p in sorted(Path(store_dir).rglob("*.parquet")):
print(f" {p.relative_to(store_dir)} ({p.stat().st_size:,} bytes)")
dataeval_cleaning/1779475221157_ace31ecf.parquet (9,607 bytes) dataeval_cleaning/1779475221276_85896084.parquet (9,607 bytes) runs/1779475221160_aa465cb2.parquet (3,631 bytes) runs/1779475221279_9a2324f6.parquet (3,631 bytes)
Next Steps¶
In this guide you learned how to:
- Create an analytics store with a storage backend (Parquet by default)
- Run a capability and write its results to the store
- Query results via SQL and compare across datasets
- Access the stored data from external tools (e.g., DuckDB, Spark)
To learn more:
- Run other capabilities (
DataevalBias,DataevalFeasibility,DataevalShift,MaiteEvaluation, etc.) and write their results to the same store for cross-capability SQL JOINs viadataset_id - For a complete list of available tables and their fields, see the Record Schema Reference (Part 5)
- For job-submission/distributed behavior (consumer vs producer store config handoff), see Analytics store in distributed execution
- For focused
configure_job_backend(...)guidance, see Job backend configuration - For developer documentation on implementing
extract()for new capabilities, see the Key Concepts page