Technology

TreeCat

A dedicated, standalone catalog engine for large data systems that replaces general-purpose stores and table-format manifest trees with a hierarchical, path-queryable, versioned metadata engine. Per [TreeCat: Standalone Catalog Engine for Large Data Systems](https://arxiv.org/abs/2503.02956).

6 connections 4 resources

Summary

What it is

Where it fits

It sits at the catalog layer of a lakehouse, the same slot occupied by Hive Metastore, AWS Glue, or an Iceberg REST catalog. Its thesis is that the catalog is a distinct workload deserving its own engine rather than a side table in Postgres or a tree of JSON manifests on S3. It is research-stage (VLDB 2025), not a drop-in production deployment.

Misconceptions / Traps

It is a catalog engine, not a query/table engine — it does not replace Iceberg/Delta as a table format, it replaces the thing that tracks them.
It is an academic prototype from UMD; treat it as a design reference, not a shipping product to deploy on prod S3 today.
The Hive Metastore / Delta / Iceberg comparison is a metadata-serving benchmark, not an end-to-end query benchmark.

Key Connections

alternative_to Hive Metastore — both serve catalog metadata; TreeCat argues the Metastore's general-purpose-RDBMS backing is a fundamental limitation.
solves Metadata Overhead at Scale — its storage format and correlated scan target range-query and versioning costs that dominate large catalogs.
competes_with Iceberg REST Catalog Spec — both define how clients talk to a catalog at scale.

Definition

What it is

TreeCat is a standalone catalog engine purpose-built to serve as the metadata catalog for large data systems, rather than bolting catalog duties onto a general-purpose RDBMS or a table format. It introduces a hierarchical data model with a path-based query language, a storage format tuned for range queries and versioning, and a correlated scan operator for fast catalog lookups. Per [TreeCat: Standalone Catalog Engine for Large Data Systems](https://arxiv.org/abs/2503.02956).

Why it exists

Self-hosted lakehouses on S3 push enormous metadata volume (partitions, snapshots, file manifests, statistics) through a catalog that is usually a Hive Metastore or a table-format manifest tree. TreeCat argues those approaches have fundamental limitations at scale and that the catalog deserves a dedicated engine — directly relevant to anyone running Iceberg/Delta on object storage and hitting catalog throughput or consistency walls. Per [TreeCat: Standalone Catalog Engine for Large Data Systems](https://www.arxiv.org/pdf/2503.02956).

Primary use cases

Lakehouse catalog serving, partition and snapshot metadata management, high-concurrency multi-client catalog reads/writes, versioned schema and table state, range-scan-heavy metadata queries.

Recent developments

Latest signals

Published in PVLDB Vol. 18 and evaluated against the incumbents. TreeCat (Keonwoo Oh, Pooja Nilangekar, Amol Deshpande, University of Maryland) appeared in Proc. VLDB Endow. 18(11): 4323-4336 (2025), benchmarked against Hive Metastore, Delta Lake, and Iceberg. Per TreeCat (VLDB Endowment).
Novel MVOCC concurrency control for serializable catalog isolation. The paper presents a multi-versioned optimistic concurrency-control protocol that guarantees serializable isolation under many concurrent clients. Per TreeCat: Standalone Catalog Engine for Large Data Systems.

Connections 6

Outbound 6

scoped_to1

Metadata Management

alternative_to2

Hive Metastore AWS Glue Catalog

competes_with2

Apache Polaris Iceberg REST Catalog Spec

solves1

Metadata Overhead at Scale

Resources 4

PaperHigh

arxiv.org/abs/2503.02956

Primary source — the full TreeCat paper with the hierarchical model, MVOCC protocol, and evaluation.

PaperHigh

dl.acm.org/doi/10.14778/3749646.3749696

Peer-reviewed PVLDB version of record, confirming venue (Vol. 18, pp. 4323-4336) and authorship.

PaperHigh

www.arxiv.org/pdf/2503.02956

Full PDF for the storage-format and concurrency-control details engineers need to evaluate the design.

SpecHigh

dblp.org/rec/journals/pvldb/OhND25.html

Authoritative bibliographic record (authors, venue, pagination) for citation integrity.