ScaleDoc: Scaling LLM-based Predicates over Large Document Collections

ScaleDoc is a novel system for efficient semantic analysis of large document collections using Large Language Models (LLMs). By implementing a two-phase cascade architecture with offline representation generation and online proxy filtering, the system reduces expensive LLM calls by up to 85% while achieving over 2x end-to-end speedup. The approach maintains high accuracy through contrastive learning and adaptive cascade mechanisms, making semantic queries practical for billions of unstructured documents.

ScaleDoc: Scaling LLM-based Predicates over Large Document Collections

ScaleDoc: A Breakthrough System for Efficient, Large-Scale Semantic Document Filtering

In a significant advancement for data processing, researchers have introduced ScaleDoc, a novel system designed to make semantic analysis of massive, unstructured document collections both practical and cost-effective. The system tackles a core bottleneck in modern analytics: the prohibitive expense of using powerful but computationally heavy Large Language Models (LLMs) for ad-hoc, semantic queries across billions of documents. By decoupling the process into smart offline and online phases, ScaleDoc achieves over a 2x end-to-end speedup and slashes expensive LLM calls by up to 85%, according to a new paper (arXiv:2509.12610v2).

The Challenge: Semantic Queries at Scale

Traditional data systems rely on value-based predicates—exact matches on structured fields—which are ill-suited for the nuanced, context-driven queries needed for unstructured text, images, and videos. While LLMs excel at this semantic understanding with zero-shot capability, their high inference cost creates unacceptable latency and financial overhead for large-scale, interactive workloads. This has created a pressing need for a hybrid architecture that preserves accuracy while dramatically improving efficiency.

How ScaleDoc Works: A Two-Phase, Cascade Architecture

ScaleDoc's innovation lies in its elegant decoupling of the predicate execution pipeline. In an offline representation phase, the system uses an LLM just once per document to generate a rich, semantic vector representation, which is then stored. The online filtering phase handles user queries by training an ultra-lightweight proxy model (like a small neural network) on these pre-computed representations. This proxy filters out the majority of clearly irrelevant documents, forwarding only the ambiguous or borderline cases to the full LLM for a final, high-confidence decision.

Core Innovations Driving Efficiency

The system's performance gains are powered by two key technical contributions. First, a contrastive-learning framework trains the proxy model not just to classify, but to generate reliable prediction scores that accurately reflect its confidence, enabling precise filtering. Second, an adaptive cascade mechanism dynamically determines the optimal filtering threshold for each query, ensuring the system meets predefined accuracy targets (e.g., 99% recall) while maximizing the number of documents resolved by the cheap proxy model.

Proven Performance Across Diverse Datasets

Evaluations across three benchmark datasets demonstrate ScaleDoc's robust effectiveness. The system consistently maintains high accuracy standards while delivering the dramatic efficiency improvements. The reduction of LLM invocations by up to 85% translates directly into lower operational costs and faster query response times, making interactive semantic search over colossal document corpora a feasible reality for enterprises.

Why This Matters: The Future of Enterprise AI Analytics

The introduction of ScaleDoc represents a pivotal step toward sustainable and scalable AI-powered data analysis. As noted in the research, the approach makes large-scale semantic filtering practical. This has profound implications for enterprise search, legal discovery, content moderation, and research, where the ability to quickly ask complex questions of vast, unstructured data repositories is paramount.

Key Takeaways

  • Solves a Critical Bottleneck: ScaleDoc directly addresses the high cost and latency of using LLMs for semantic queries on massive document sets.
  • Hybrid, Efficient Architecture: Its two-phase design decouples expensive LLM processing from real-time querying, using a lightweight proxy model for initial filtering.
  • Significant Performance Gains: The system achieves over 2x end-to-end speedup and reduces costly LLM calls by up to 85% while maintaining high accuracy.
  • Enables New Applications: This efficiency breakthrough makes interactive, semantic analysis of billion-document databases economically and technically viable.

常见问题