Architecture

Direct Corpus Interaction (DCI)

An agentic-search retrieval method where the LLM uses terminal primitives (grep, file reads, scripts) to interrogate the raw corpus directly, with no embeddings, vector index, or retrieval API. Per [Beyond Semantic Similarity (arXiv)](https://arxiv.org/abs/2605.05242).

5 connections 4 resources 1 post

Summary

What it is

An agentic-search retrieval method where the LLM uses terminal primitives (grep, file reads, scripts) to interrogate the raw corpus directly, with no embeddings, vector index, or retrieval API. Per [Beyond Semantic Similarity (arXiv)](https://arxiv.org/abs/2605.05242).

Where it fits

It is an alternative to vector RAG in the retrieval layer of agentic systems. Instead of a pre-built index returning fixed chunks, the agent reasons about how to query files and controls its own retrieval resolution. It targets corpora that evolve faster than an index can be rebuilt — the common case for local-first, S3-backed data.

Misconceptions / Traps
  • DCI is not "RAG with better embeddings" — it removes the embedding model and vector index entirely.
  • It shifts cost from offline indexing to online agent tokens/tool calls; cheaper inference (see DeepSeek V4) is what makes it economically viable at scale.
  • Reported wins are on specific benchmarks (BRIGHT, BEIR, BrowseComp-Plus, multi-hop QA); generalization to arbitrary production corpora is not guaranteed.
Key Connections
  • alternative_to RAG over Structured Data — DCI is positioned explicitly as the post-vector-RAG retrieval paradigm.
  • bypasses Semantic Search — it deliberately drops embedding similarity in favor of direct lexical/tool-driven navigation.
  • solves Context Bottleneck — agent-controlled resolution lets it pull exactly what it needs instead of fixed chunks.
  • augments Model Context Protocol (MCP) — terminal/file tooling DCI relies on is the kind of capability MCP servers expose.

Definition

What it is

Direct Corpus Interaction is an agentic retrieval paradigm in which an LLM agent searches the raw, unprocessed corpus directly using general-purpose terminal tools — grep, file reads, shell commands, lightweight scripts — instead of an embedding model, vector index, or retrieval API. It is introduced in the 2026 paper "Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction." Its core insight is that retrieval quality depends not only on the model's reasoning but on the resolution of the interface through which it touches the corpus. Per [Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction](https://arxiv.org/abs/2605.05242).

Why it exists

Conventional RAG over data stored on S3 requires an offline indexing pipeline — chunk, embed, build and host a vector index — that goes stale as the corpus evolves. DCI requires no offline indexing and adapts naturally to evolving local corpora, which fits self-hosted, file-on-object-storage setups where the data changes constantly and maintaining a separate vector store is operational overhead. Per [Beyond Semantic Similarity (arXiv)](https://arxiv.org/abs/2605.05242).

Primary use cases

Agentic search over local/evolving corpora, code and document search without a vector DB, multi-hop QA, deep-research browsing agents, retrieval where re-indexing cost is prohibitive.

Recent developments

Latest signals
  • Submitted to arXiv on May 3, 2026, with a 19-author roster and a public code release. The paper (Zhuofeng Li, Haoxiang Zhang, Cong Wei, Pan Lu, et al.) ships a reference implementation as DCI-Agent-Lite. Per Beyond Semantic Similarity (arXiv).
  • Reported to beat sparse, dense, and reranking baselines on BRIGHT and BEIR with no semantic retriever. The authors report strong accuracy on BrowseComp-Plus and multi-hop QA without any conventional embedding-based retriever. Per Beyond Semantic Similarity (Hugging Face Papers).
  • Part of a broader 2026 agentic-retrieval wave. Related efforts such as GrepSeek (training search agents for direct corpus interaction) and Interact-RAG signal a trend away from black-box vector retrieval. Per GrepSeek: Training Search Agents for Direct Corpus Interaction.

Connections 5

Outbound 5

Resources 4

Featured in