Faithful evaluation of large language models (LMs) that reflects their true capabilities is becoming increasingly important for guiding model design. A canonical approach from the machine learning community is benchmarking with diverse evaluation datasets. With progress in standardizing evaluation such as LM-Evaluation-Harness, Open LLM Leaderboard, HELM, users can gain knowledge about models by observing and comparing numerical task performance. These efforts seek to tackle the challenge of evaluating language models, focusing on reliability issues such as sensitivity to prompt formats and in-context examples.
A fundamental problem still remains: transforming leaderboard performance into rigorous scientific understanding and actionable insights. It is obvious that, without proper confounder control, we might even lead to misleading conclusions [BSB+23, HBU+25]. For example, Qwen-2.5-3B demonstrates substantially different behaviors in math reasoning problems compared to Llama-3.2-3B [GCS+25]. However, conducting controlled studies by pre-training from scratch to thoroughly explore the design space while accounting for confounding factors can be costly.
In light of those challenges, we leverage observational data to gain insights into LM capabilities. Observational studies have long served as a crucial tool in causal inference when experiments are difficult to conduct. Historically, scientists have relied on observational data to generate hypotheses [SPP+05], inform experimental design [TBP+09], and uncover actionable insights [DPB+04].
In this work, to enable the use of causal inference, we leverage the Open LLM Leaderboard that scores thousands of models on six benchmarks: MMLU-Pro, BBH, IFEval, GPQA, MUSR, and MATH-hard. We propose a principled approach to model and identify a hierarchy of LM capabilities using observational data alone. By Pearl’s causal-graph framework, a causal edge \(A \rightarrow B\) implies that the conditional distribution \(P(B|A)\) remains unchanged if we intervene on \(A\). Intuitively, improving one skill may lift another "downstream" skill, whereas the reverse may not hold. Knowing this hierarchical structure and modularity is important for both model developers (e.g., training data selection [CRK+23]) and model evaluators (e.g., understanding skill compositions [YKG+24]).
A special case of interest is the linear causal model, where given a directed acyclic graph \(\mathcal{G}\), \(z_i = \sum_{j\in\mathrm{pa}_{\mathcal{G}}(i)} w_{ji}z_j + \epsilon_i\), where \(\mathrm{pa}_{\mathcal{G}}(i)\) is the parent node set of \(i\) in \(\mathcal{G}\), and \(\epsilon_i\)’s are independent source variables. As we have seen, causal graphs are well-suited for modeling hierarchical relationships among LM capabilities. However, directly constructing a causal graph over capabilities is problematic for two reasons.
Under our hypotheses, our goal is to identify the latent capability factors, or, equivalently, the mapping between benchmark performance and underlying capabilities. Mathematically, let \( \theta_1, \theta_2, \cdots, \theta_K \) be the base models, \( \{x_{i}^{(k)}\}_{i=1}^{N_k} \subset \mathbb{R}^m \) be the performance of \( N_k \) models fine-tuned on base model \( \theta_k \) across \( m \) benchmarks. Our goal is to find a matrix \( H \) such that \( z_{i}^{(k)} = Hx_{i}^{(k)}, \, i \in [N_k] \) are generated from a linear causal model. From a linear causal model \( z_i^{(k,m)} = \sum_{j\in\mathrm{pa}_{\mathcal{G}}(i)} w_{ji}^{(k)}z_j^{(k,m)}+\epsilon_i^{(k,m)}, m\in[N_k] \), we can rewrite this equation as \( \epsilon^{(k,m)} = B_k z^{(k,m)} \).
Identifying latent causal structure behind unstructured observations is a central theme in causal representation learning (CRL). We introduce Hierarchical Component Analysis (HCA), a hierarchical version of component analysis that provably recovers the latent factors when our hypotheses hold exactly.
ICA tells us how to recover \( M_k \) that maps \( x \) to source variables \( \epsilon \) up to row permutations (corresponding to different permutations of the same source variables). To recover the causal model shown in Figure 3(c) that involves the unknown latent causal factors \( z \), we need to find matrices \( B_k \) and \( H \) such that \( M_k \) is equal to \( B_kH \) to row permutations, where \( B_k \)’s are lower triangular by definition.
There exists multiple equivalent causal models that can fit the data in our setting. Specifically, given any lower triangular matrix \( L \), \( B_k, H \) and \( B_kL, L^{-1}H \) correspond to two different causal models that induce the same \( M_k \) for all \( k \). We prove that HCA identifies the correct equivalent class [JS24].
The key step of our algorithm, HCA, is illustrated in Figure 3(d). If the row permutation is correctly specified, i.e., \( M_k=B_kH \) for all \( k\in[K] \), let \( M_k’ \) be the matrix obtained by performing Gram-Schmidt orthogonalization to the rows of \( M_k \). Since \( B_k \)’s are lower triangular, it is not hard to see that for all \( i \), the \( i \)-th row of \( M_k’, k\in[K] \) must all be colinear with the \( i \)-th row of \( H \), allowing us to recover \( H \). To find the correct row permutation, we iterate over all possibilities and pick one such that the corresponding rows of \( M_k’, k\in[K] \) are closest to being colinear.
Applying HCA to models fine-tuned from four popular bases – Llama-3-8B, Llama-3.1-8B, Qwen-2.5-7B, and Qwen-2.5-14B – we find that their benchmark accuracies are well described by a linear causal model with \( d=3 \) nodes. The causal graphs for each of them are shown in Figure 4.
Now we have recovered the latent factors, but interpreting these factors remains a challenge. Regressing the latent factors on benchmark performance shows that
The celebrated scaling law literature links pre-training compute to loss, and has been extended to downstream benchmark performance. However, post-training can markedly change how these performances scale. Many existing studies run controlled experiments to understand why, but such experiments are expensive, cover only a small subset of models, and thus struggle to establish statistical reliability.
Following a similar logic, intervening on \(z_1\) could affect both \(z_2\) and \(z_3\). However, we find that \(z_1\) is predominantly determined by the pretraining stage. In Figure 6, we demonstrate that performance on BBH can be well-fitted by a sigmoid scaling law.
We also conduct sensitivity analysis in Appendix H [JSKZ25], where we control for base models in HCA: We choose two additional base models, Qwen2.5-3B and llama-3.2-3B. They have similar PC subspaces as the four base models that we previously used, but have a smaller sample size. We repeat our analysis using all six base models, and find that \(z_2\) and \(z_3\) remains consistent, while \(z_1\) now represents a tradeoff between general reasoning (BBH, MMLU-Pro) and instruction following (IFEval). This suggests that the causal link from \(z_1\) to \(z_2\) can be more sensitive to the choice of base models, while the link from \(z_2\) to \(z_3\) remains more robust.
This also helps mitigate the concern that our discovered graph may simply reflect selection bias,1 particularly because Qwen-series models are known to exhibit distinctive behaviors in math reasoning [GCS+25].
After uncovering the hierarchy and interpreting its nodes, we can infer what data is most valuable during post-training. For example, our results suggest that fine-tuning on instruction-following tasks (intervening on \(z_2\)) should also improve math reasoning. To test this, we fine-tuned Qwen-2.5-7B, Qwen2.5-14B and Llama-3-8B on IFEval's instruction data (roughly 500 examples) and observed a non-negligible gain on Math-Hard (Lvl-5) – an improvement achieved with surprisingly little data.
Experimental results above suggest that some of the observed improvements in math performance after post-training may stem from improved instruction-following, rather than genuine gains in mathematical reasoning, the capabilities the MATH benchmark aims to assess.
Overall, our results offer some more pragmatic takeaways in evaluation for different parties:
Through a brief survey, we found that hierarchical capability structures are also well-studied in cognitive science. On human subjects, cognitive scientists and psychologists share surprisingly similar methods as we used and findings as we observed in the language models [KCB11]: researchers used factor analysis on subjects about domains of creativity to discover an overarching general factor and 7 domain-specifc factors -- artistic-verbal, artisticvisual, entrepreneur, interpersonal, math/science, performance and problem solving. It showed that domain-general capabilities may be more relevant to some domains, like performance than others like math/science.
As the recovered capability hierarchy of language models closely mirrors how we structure human learners' capabilities, this perspective may offer researchers insights to explore the parallels between artificial and natural intelligence, facilitating more reliable development and evaluation of increasingly capable AI systems.
@inproceedings{
jin2025crl,
title={Discovering Hierarchical Latent Capabilities of Language Models via Causal Representation Learning},
author={Jikai Jin, Vasilis Syrgkanis, Sham M. Kakade, Hanlin Zhang},
url={https://arxiv.org/abs/2506.10378},
year={2025},
journal={arXiv preprint arXiv:2506.10378},
}