Decoding the Biological "Black Box": Sparse Autoencoders Reveal What Single-Cell AI Models Really Learn
A groundbreaking study has applied a powerful interpretability technique to two leading single-cell biology AI models, revealing they have internalized vast amounts of organized biological knowledge—from pathways to protein interactions—but have learned almost no causal regulatory logic. By training Sparse Autoencoders (SAEs) on the Geneformer and scGPT foundation models, researchers created detailed atlases of over 107,000 interpretable features, uncovering massive "superposition" where networks compress more concepts than they have neurons. The findings, published on arXiv, establish that while these models are rich repositories of statistical biological relationships, their representations remain a bottleneck for predicting causal, perturbation-like effects.
Mapping the Mind of a Biological AI
To peer inside the "black box" of the models, the research team trained TopK SAEs on the residual stream activations across every layer. For the 18-layer, 316-million-parameter Geneformer V2, this yielded an atlas of 82,525 features. For the 12-layer, whole-human scGPT model, it produced 24,527 features. The analysis immediately confirmed a central tenet of mechanistic interpretability: extreme superposition. A staggering 99.8% of these learned features were invisible to traditional analysis methods like Singular Value Decomposition (SVD), meaning the models' dense activations are highly compressed mixtures of concepts.
The organized nature of this knowledge became clear upon systematic characterization. Between 29% and 59% of the SAE features could be annotated to major biological databases, including Gene Ontology (GO), KEGG, Reactome, STRING (protein-protein interactions), and TRRUST (transcriptional regulation). The models exhibited a U-shaped pattern of biological abstraction across layers, with middle layers showing the highest functional specificity, reflecting a hierarchical processing of information.
From Modules to Missing Causality
Further analysis showed these features are not isolated. They organized into co-activation modules—141 in Geneformer and 76 in scGPT—suggesting the AI has learned coherent functional units. The features also demonstrated causal specificity (median 2.36x) and formed "information highways" with 63% to 99.8% of features communicating across multiple model layers.
The critical test, however, was evaluating whether this knowledge included causal regulatory logic. The team rigorously tested the feature atlases against genome-scale CRISPRi perturbation data. The results were stark: only 3 out of 48 transcription factors (6.2%) elicited feature responses specific to their known regulatory targets. Even employing a multi-tissue control strategy provided only a marginal improvement, raising the success rate to just 10.4% (5 of 48 TFs). This definitively establishes that the models' internal representations, not the training data or task, are the primary bottleneck for encoding causal mechanisms.
Key Takeaways and Research Implications
- Knowledge vs. Causality: Single-cell foundation models like Geneformer and scGPT are powerful compendiums of biological relationships (pathways, interactions, modules) but lack the causal logic needed for accurate perturbation prediction.
- Superposition is Ubiquitous: The study provides concrete evidence that superposition is a dominant feature in biological AI, with SAEs being a necessary tool to decompose dense, polysemantic neuron activations.
- A New Resource for Discovery: The release of both interactive feature atlases opens a new paradigm for exploration, allowing biologists to query over 107,000 interpretable features across 30 layers of two state-of-the-art models.
- Defining the Next Frontier: The work clearly outlines the challenge for the next generation of biological AI: moving beyond correlation to build models that inherently capture regulatory causality, potentially through novel architectures or training objectives.
The study concludes that while current single-cell foundation models have achieved a remarkable synthesis of biological knowledge, their utility in forward-looking, causal inference tasks is severely limited. The publicly released atlases transform these models from inscrutable functions into explorable resources, setting a new standard for transparency and paving the way for more causally-aware architectures in computational biology.