ScaleDoc: A Breakthrough System for Efficient, Large-Scale Semantic Document Filtering
In the era of unstructured data, a new system called ScaleDoc promises to revolutionize how we query vast document collections by dramatically cutting the cost of using Large Language Models (LLMs). Announced in a recent arXiv paper (2509.12610v2), the system tackles a critical bottleneck: while LLMs excel at understanding semantic queries, their high computational expense makes them impractical for filtering millions of documents in real-time. ScaleDoc elegantly solves this by decoupling the process into offline preparation and an intelligent online cascade, achieving over a 2x end-to-end speedup and reducing costly LLM calls by up to 85%.
The Core Challenge: Semantic Queries at Scale
Traditional data systems rely on value-based predicates (e.g., "date = 2024"), but modern analysis of unstructured documents—like reports, emails, or articles—requires semantic understanding (e.g., "find documents discussing ethical AI governance"). LLMs have powerful zero-shot capabilities for this task, but their immense inference cost creates unacceptable latency and expense for large-scale, ad-hoc queries. The research team identified that most documents in a typical query are non-matches; the key was to find a way to filter them out cheaply before invoking the heavy LLM.
How ScaleDoc Works: A Two-Phase, Adaptive Architecture
ScaleDoc's innovation lies in its split architecture. In an offline representation phase, the system uses an LLM once per document to generate a dense, semantic vector representation, effectively creating a searchable index of meaning. The real magic happens online. For each user query, ScaleDoc does not immediately use the LLM. Instead, it employs two core techniques.
First, it uses a contrastive-learning-based framework to rapidly train a tiny, lightweight proxy model (like a small neural network) on the pre-computed document representations. This proxy learns to score how well documents match the query's intent. Second, an adaptive cascade mechanism uses these scores to implement a filtering policy. Documents with clearly high or low scores are filtered immediately. Only the ambiguous cases—a small fraction—are forwarded to the full LLM for a final, accurate decision, ensuring the system meets predefined accuracy targets.
Proven Performance and Practical Impact
Evaluations across three diverse datasets confirm the system's efficacy. By slashing LLM invocations by up to 85%, ScaleDoc makes large-scale semantic search both practical and cost-effective. The reported 2x end-to-end speedup translates to faster insights for analysts and researchers and enables new applications previously hindered by cost and latency. This approach provides a scalable blueprint for integrating powerful but expensive foundation models into production data systems.
Why This Matters: Key Takeaways
- Bridges a Critical Gap: ScaleDoc directly addresses the trade-off between the semantic power of LLMs and their prohibitive operational costs for large-scale filtering.
- Hybrid Intelligence Model: The system exemplifies a powerful trend: using a small, adaptive model to handle routine decisions, reserving the large, expensive model only for complex, edge cases.
- Enables New Applications: By making semantic analysis of massive document corpora efficient, it opens doors for real-time compliance monitoring, intelligent enterprise search, and large-scale qualitative research.
- Architectural Blueprint: The decoupled, two-phase design with an adaptive cascade is a significant contribution likely to influence the design of future data analysis systems.