ScaleDoc: Scaling LLM-based Predicates over Large Document Collections

ScaleDoc is a novel system that enables efficient semantic filtering of large document collections using Large Language Models (LLMs). By decoupling the process into offline and online phases and employing a lightweight proxy model, it achieves over 2x end-to-end speedup and reduces expensive LLM calls by up to 85%. The system uses a contrastive-learning-based framework and adaptive cascade mechanism to maintain accuracy while dramatically lowering computational costs.

ScaleDoc: Scaling LLM-based Predicates over Large Document Collections

ScaleDoc: A Breakthrough System for Efficient, Large-Scale Semantic Document Filtering

Researchers have introduced ScaleDoc, a novel system designed to solve a critical bottleneck in modern data analysis: the high cost of using Large Language Models (LLMs) for semantic filtering of massive, unstructured document sets. By decoupling the process into offline and online phases and employing a lightweight proxy model, the system achieves over a 2x end-to-end speedup and reduces expensive LLM calls by up to 85%, making large-scale semantic querying practical for the first time.

Traditional data systems rely on value-based predicates (e.g., "date > 2020") that are ill-suited for the semantic understanding required to query unstructured text, images, or audio. While LLMs possess powerful zero-shot capabilities for this task, their immense computational cost creates unacceptable overhead for ad-hoc queries across enormous document collections. ScaleDoc directly addresses this efficiency barrier, bridging the gap between powerful semantic understanding and scalable system performance.

Architectural Innovation: Decoupling Cost from Query Time

The core of ScaleDoc's design is a two-phase architecture that separates the expensive semantic understanding work from the time-sensitive online query. In the offline representation phase, the system uses an LLM just once per document to generate a rich, reusable semantic representation. This moves the bulk of the computational burden to a pre-processing step.

During the online filtering phase, for each new query, ScaleDoc does not immediately invoke the LLM. Instead, it rapidly trains a tiny, task-specific proxy model on the pre-computed document representations. This proxy efficiently filters out the majority of non-relevant documents, forwarding only the ambiguous or uncertain cases to the full LLM for a final, accurate decision.

Core Techniques for Speed and Accuracy

To ensure this cascade is both fast and reliable, the researchers developed two key innovations. First, a contrastive-learning-based framework trains the proxy model to produce well-calibrated prediction scores, allowing it to confidently identify clear-cut matches and non-matches. Second, an adaptive cascade mechanism dynamically determines the optimal filtering threshold for each query, ensuring the system meets predefined accuracy targets while maximizing the number of documents filtered by the cheap proxy.

"The adaptive cascade is crucial," explains an expert in AI systems engineering. "It moves beyond a fixed threshold to a policy that intelligently balances confidence and cost, which is essential for handling the varied complexity of real-world semantic queries."

Proven Performance Across Diverse Datasets

The team evaluated ScaleDoc across three benchmark datasets involving complex semantic queries over document collections. The results, detailed in the technical paper (arXiv:2509.12610v2), were compelling. The system consistently achieved the dual goals of high accuracy and radical efficiency. The 85% reduction in LLM invocations translates directly to dramatically lower operational costs and latency, making interactive semantic analysis on large corpora a feasible reality.

Why This Matters: The Future of Data Systems

  • Unlocks New Workloads: Makes querying massive archives of reports, legal documents, or research papers with natural language both fast and affordable.
  • Reduces AI Operational Costs: Drastically cuts the expense of running large foundation models in production by minimizing their use to only the most necessary cases.
  • Hybrid AI System Design: Exemplifies a powerful architectural pattern—using a large model for offline enrichment and a small, adaptive model for online efficiency—that will be critical for scalable AI applications.
  • Practical Semantic Search: Moves semantic understanding from a niche, expensive tool to a core, scalable component of modern data platforms.

The introduction of ScaleDoc represents a significant leap forward for data-intensive applications. By solving the LLM efficiency problem for semantic predicates, it paves the way for a new generation of analytical systems that can truly understand the content they process at a massive scale.

常见问题