TreeCat
A dedicated, standalone catalog engine for large data systems that replaces general-purpose stores and table-format manifest trees with a hierarchical, path-queryable, versioned metadata engine. Per [TreeCat: Standalone Catalog Engine for Large Data Systems](https://arxiv.org/abs/2503.02956).
Summary
A dedicated, standalone catalog engine for large data systems that replaces general-purpose stores and table-format manifest trees with a hierarchical, path-queryable, versioned metadata engine. Per [TreeCat: Standalone Catalog Engine for Large Data Systems](https://arxiv.org/abs/2503.02956).
It sits at the catalog layer of a lakehouse, the same slot occupied by Hive Metastore, AWS Glue, or an Iceberg REST catalog. Its thesis is that the catalog is a distinct workload deserving its own engine rather than a side table in Postgres or a tree of JSON manifests on S3. It is research-stage (VLDB 2025), not a drop-in production deployment.
- It is a catalog engine, not a query/table engine — it does not replace Iceberg/Delta as a table format, it replaces the thing that tracks them.
- It is an academic prototype from UMD; treat it as a design reference, not a shipping product to deploy on prod S3 today.
- The Hive Metastore / Delta / Iceberg comparison is a metadata-serving benchmark, not an end-to-end query benchmark.
alternative_toHive Metastore — both serve catalog metadata; TreeCat argues the Metastore's general-purpose-RDBMS backing is a fundamental limitation.solvesMetadata Overhead at Scale — its storage format and correlated scan target range-query and versioning costs that dominate large catalogs.competes_withIceberg REST Catalog Spec — both define how clients talk to a catalog at scale.
Definition
TreeCat is a standalone catalog engine purpose-built to serve as the metadata catalog for large data systems, rather than bolting catalog duties onto a general-purpose RDBMS or a table format. It introduces a hierarchical data model with a path-based query language, a storage format tuned for range queries and versioning, and a correlated scan operator for fast catalog lookups. Per [TreeCat: Standalone Catalog Engine for Large Data Systems](https://arxiv.org/abs/2503.02956).
Self-hosted lakehouses on S3 push enormous metadata volume (partitions, snapshots, file manifests, statistics) through a catalog that is usually a Hive Metastore or a table-format manifest tree. TreeCat argues those approaches have fundamental limitations at scale and that the catalog deserves a dedicated engine — directly relevant to anyone running Iceberg/Delta on object storage and hitting catalog throughput or consistency walls. Per [TreeCat: Standalone Catalog Engine for Large Data Systems](https://www.arxiv.org/pdf/2503.02956).
Lakehouse catalog serving, partition and snapshot metadata management, high-concurrency multi-client catalog reads/writes, versioned schema and table state, range-scan-heavy metadata queries.
Recent developments
- Published in PVLDB Vol. 18 and evaluated against the incumbents. TreeCat (Keonwoo Oh, Pooja Nilangekar, Amol Deshpande, University of Maryland) appeared in Proc. VLDB Endow. 18(11): 4323-4336 (2025), benchmarked against Hive Metastore, Delta Lake, and Iceberg. Per TreeCat (VLDB Endowment).
- Novel MVOCC concurrency control for serializable catalog isolation. The paper presents a multi-versioned optimistic concurrency-control protocol that guarantees serializable isolation under many concurrent clients. Per TreeCat: Standalone Catalog Engine for Large Data Systems.
Connections 6
Outbound 6
scoped_to1alternative_to2competes_with2solves1Resources 4
Primary source — the full TreeCat paper with the hierarchical model, MVOCC protocol, and evaluation.
Peer-reviewed PVLDB version of record, confirming venue (Vol. 18, pp. 4323-4336) and authorship.
Full PDF for the storage-format and concurrency-control details engineers need to evaluate the design.
Authoritative bibliographic record (authors, venue, pagination) for citation integrity.