FinTexTS: Financial Text-Paired Time-Series Dataset via Semantic-Based and Multi-Level Pairing

FinTexTS is a novel financial dataset that pairs news articles with stock time-series data using semantic AI and multi-level classification. The framework employs large language models to analyze news across four contextual levels—macroeconomic, sector, related company, and target company—overcoming traditional keyword matching limitations. Experimental results demonstrate FinTexTS enables more accurate stock price forecasting than conventional datasets.

FinTexTS: Financial Text-Paired Time-Series Dataset via Semantic-Based and Multi-Level Pairing

Semantic AI Framework Transforms Financial Forecasting with Multi-Level News Analysis

Researchers have unveiled a novel artificial intelligence framework designed to overcome a critical limitation in financial market prediction: the inability of existing models to effectively capture the complex, multi-layered interdependencies that drive stock prices. By moving beyond simplistic keyword matching, the new method employs semantic understanding and large language models (LLMs) to pair financial news articles with time-series stock data across four distinct contextual levels. This approach has been used to construct FinTexTS, a new large-scale, semantically-paired dataset that demonstrates superior performance in stock price forecasting tasks.

The core innovation addresses a well-known challenge in quantitative finance. A company's stock performance is influenced by a tapestry of factors, including its own corporate events, actions by competitors or partners, sector-wide trends, and broader macroeconomic shifts. Traditional datasets that link text to numerical data often rely on basic keyword overlaps, such as a company's ticker symbol appearing in a news headline. This fails to capture the nuanced semantic relationships—like a supplier's bankruptcy affecting a manufacturer—that are crucial for accurate forecasting.

A Two-Pronged Framework for Intelligent Data Pairing

The proposed framework introduces a sophisticated, two-stage methodology for creating high-quality, text-paired financial datasets. First, it establishes a semantic-based pairing mechanism. Instead of scanning news for simple keywords, the system extracts rich, company-specific context from authoritative sources like SEC filings. It then uses embedding-based similarity matching to retrieve news articles that are semantically relevant to this context, ensuring a deeper thematic connection.

Second, the framework implements a multi-level classification system. A large language model analyzes each retrieved news article and categorizes its primary impact into one of four hierarchical levels: macro-level (economy-wide), sector-level (industry-specific), related company-level (involving other firms), and target-company level (directly about the firm in question). This structured classification allows forecasting models to weight and interpret information based on its presumed relevance and scope of impact.

Building FinTexTS and Validating Performance

Applying this framework to publicly available news corpora resulted in the creation of FinTexTS. Experiments conducted on this new dataset confirmed that the semantic, multi-level pairing strategy leads to more accurate stock price forecasts compared to methods using datasets built with conventional keyword matching. The research, documented in the preprint arXiv:2603.02702v1, provides empirical evidence for the value of context-aware data construction.

Notably, the study also found that the framework's benefits are amplified with higher-quality input data. When applied to proprietary, carefully curated news sources—which presumably offer greater depth, accuracy, and timeliness—the pairing process yielded an even higher-quality dataset. This superior data subsequently drove further improvements in forecasting model performance, highlighting the compound value of combining advanced AI methodology with premium information sources.

Why This Matters for AI and Finance

  • Closes a Critical Data Gap: The finance industry has abundant numerical and textual data, but effectively fusing them has been a bottleneck. This framework provides a scalable, intelligent method to build the high-fidelity datasets needed for next-generation AI models.
  • Enhances Model Interpretability: By classifying news into specific impact levels (macro, sector, etc.), the framework makes AI-driven forecasts more interpretable. Analysts can understand not just the prediction, but the *type* of news influencing it.
  • Unlocks Alpha in Alternative Data: The ability to semantically link diverse news to specific companies allows quantitative funds to more systematically mine alternative data sources, potentially uncovering non-obvious signals for trading strategies.
  • Demonstrates LLMs' Practical Utility: This work is a concrete example of using LLMs not as an end-user chatbot, but as a powerful tool within a data-engineering pipeline to add structure and semantic understanding to unstructured text at scale.

The introduction of this semantic pairing framework and the FinTexTS dataset marks a significant step toward AI systems that can reason about financial markets with a sophistication closer to that of a human analyst, considering the web of relationships that truly move prices.

常见问题