perf: related_by_codelist N+1 queries take 40s+ (#71) · Issues · Pinax Suite / pinax

perf: related_by_codelist N+1 queries take 40s+

## Summary `CatalogStore.related_by_codelist()` takes 40+ seconds for a typical dataset due to an N+1 query pattern — the SQL join finds related identifiers, then `store.get()` is called individually for each result. ## Root cause In `store.py:2649-2672`: ```python def related_by_codelist(self, dataset_id: str) -> list[BaseDataset]: ... rows = self._conn.execute(""" SELECT DISTINCT d2.identifier FROM _pinax.dataset d1 JOIN sdmx.dataflow df1 ON df1.urn = d1.sdmx_dataflow_urn JOIN sdmx.dsd_component dc1 ON dc1.dsd_urn = df1.dsd_urn JOIN sdmx.dsd_component dc2 ON dc2.codelist_urn = dc1.codelist_urn JOIN sdmx.dataflow df2 ON df2.dsd_urn = dc2.dsd_urn JOIN _pinax.dataset d2 ON d2.sdmx_dataflow_urn = df2.urn WHERE d1.identifier = ? AND d2.identifier != ? AND dc1.codelist_urn IS NOT NULL """, [dataset_id, dataset_id]).fetchall() return [ds for r in rows if (ds := self.get(BaseDataset, r[0])) is not None] ``` The final line calls `self.get(BaseDataset, r[0])` per row — each `get()` runs its own SQL query to hydrate the full dataset object. With dozens of related datasets this adds up to 40s+. ## Measured impact ``` Related endpoint: 40.017s (status=200) ``` metadatahub works around this by lazy-loading via HTMX, but the underlying method is still very slow. ## Suggested fix Batch-hydrate related datasets in a single query (similar to how `_execute_query` fetches multiple datasets). Something like: ```python identifiers = [r[0] for r in rows] return self.get_many(BaseDataset, identifiers) ``` Or extend the SQL join to include the dataset columns directly, avoiding the second round-trip entirely.

issue