Technology

TreeCat

A dedicated, standalone catalog engine for large data systems that replaces general-purpose stores and table-format manifest trees with a hierarchical, path-queryable, versioned metadata engine. Per [TreeCat: Standalone Catalog Engine for Large Data Systems](https://arxiv.org/abs/2503.02956).

6 connections 4 resources

Summary

What it is

A dedicated, standalone catalog engine for large data systems that replaces general-purpose stores and table-format manifest trees with a hierarchical, path-queryable, versioned metadata engine. Per [TreeCat: Standalone Catalog Engine for Large Data Systems](https://arxiv.org/abs/2503.02956).

Where it fits

It sits at the catalog layer of a lakehouse, the same slot occupied by Hive Metastore, AWS Glue, or an Iceberg REST catalog. Its thesis is that the catalog is a distinct workload deserving its own engine rather than a side table in Postgres or a tree of JSON manifests on S3. It is research-stage (VLDB 2025), not a drop-in production deployment.

Misconceptions / Traps
  • It is a catalog engine, not a query/table engine — it does not replace Iceberg/Delta as a table format, it replaces the thing that tracks them.
  • It is an academic prototype from UMD; treat it as a design reference, not a shipping product to deploy on prod S3 today.
  • The Hive Metastore / Delta / Iceberg comparison is a metadata-serving benchmark, not an end-to-end query benchmark.
Key Connections
  • alternative_to Hive Metastore — both serve catalog metadata; TreeCat argues the Metastore's general-purpose-RDBMS backing is a fundamental limitation.
  • solves Metadata Overhead at Scale — its storage format and correlated scan target range-query and versioning costs that dominate large catalogs.
  • competes_with Iceberg REST Catalog Spec — both define how clients talk to a catalog at scale.

Definition

What it is

TreeCat is a standalone catalog engine purpose-built to serve as the metadata catalog for large data systems, rather than bolting catalog duties onto a general-purpose RDBMS or a table format. It introduces a hierarchical data model with a path-based query language, a storage format tuned for range queries and versioning, and a correlated scan operator for fast catalog lookups. Per [TreeCat: Standalone Catalog Engine for Large Data Systems](https://arxiv.org/abs/2503.02956).

Why it exists

Self-hosted lakehouses on S3 push enormous metadata volume (partitions, snapshots, file manifests, statistics) through a catalog that is usually a Hive Metastore or a table-format manifest tree. TreeCat argues those approaches have fundamental limitations at scale and that the catalog deserves a dedicated engine — directly relevant to anyone running Iceberg/Delta on object storage and hitting catalog throughput or consistency walls. Per [TreeCat: Standalone Catalog Engine for Large Data Systems](https://www.arxiv.org/pdf/2503.02956).

Primary use cases

Lakehouse catalog serving, partition and snapshot metadata management, high-concurrency multi-client catalog reads/writes, versioned schema and table state, range-scan-heavy metadata queries.

Recent developments

Latest signals
  • Published in PVLDB Vol. 18 and evaluated against the incumbents. TreeCat (Keonwoo Oh, Pooja Nilangekar, Amol Deshpande, University of Maryland) appeared in Proc. VLDB Endow. 18(11): 4323-4336 (2025), benchmarked against Hive Metastore, Delta Lake, and Iceberg. Per TreeCat (VLDB Endowment).
  • Novel MVOCC concurrency control for serializable catalog isolation. The paper presents a multi-versioned optimistic concurrency-control protocol that guarantees serializable isolation under many concurrent clients. Per TreeCat: Standalone Catalog Engine for Large Data Systems.

Connections 6

Outbound 6

Resources 4