Sanity Checks for Distributed Alignment Search

Interpretability methods like linear probes investigate the representation encoded in certain parts of a neural network. However, what we really care about, usually, is whether and how the model is using that representation for making predictions. On a strong reading, this asks for the causal relationship between representations and outputs. We can call this the causal interpretability question.

Distributed Alignment Search (DAS) is one of the more promising methods in the causal interpretability toolkit. Introduced by Geiger et al. (2024), DAS uses gradient descent to find alignments between interpretable high-level causal variables and distributed neural representations. The core claim is that a BERT model fine-tuned on a natural language inference task implements, at the level of its internal representations, a symbolic algorithm with discrete variables for negation and lexical entailment. It is little exaggeration to say that this is the current flagship result for causal interpretability.

But how robust is this claim? In their paper, Geiger et al. test DAS on a randomly initialized network for their first experiment: a simple feed-forward network on a hierarchical equality task. They find that IIA stays near chance for small networks but creeps upward as hidden dimensionality grows. This is already worrying. Moreover, for the flagship experiment involving BERT on Monotonicity NLI (henceforth MoNLI), they don't run this check. BERT has a 768-dimensional hidden representation. With an intervention size of 256, DAS is searching through a third of the entire representation space. Is that enough room for the method to find spurious alignments through geometric coincidence rather than genuine causal structure?

Actually, it looks like no! Following the methodology of Adebayo et al. (2018), who proposed randomization-based sanity checks for saliency maps, I designed three basic robustness tests for DAS on the MoNLI experiment. And the good news is that DAS passed all of them.

Test 01

Full Randomization

IIA ≈ 0.39 on random network

Test 02

Cascading Randomization

Graded, layer-localized degradation

Test 03

Causal Specificity

IIA ≈ 0.50 for wrong causal models

Background: How DAS Works

The setup is as follows. You have a trained neural network (the "low-level model") and an interpretable causal model (the "high-level model") that you hypothesize explains the network's behavior. For MoNLI, the high-level model says:

Compute whether negation is present in the premise and hypothesis → binary variable V1
Compute the lexical entailment relation between the key words → binary variable V2
If negation is present, reverse the entailment relation; otherwise, output it directly

DAS asks: is there a rotation of the [CLS] token representation at a given layer such that, in the rotated basis, two orthogonal subspaces cleanly encode V1 and V2? "Cleanly encode" is operationalized through interchange interventions: take a base input and a source input, swap the subspace corresponding to (say) V1 from the source into the base, unrotate, and feed the result through the remaining layers. If the output matches what the high-level model predicts should happen when you swap V1, that's a hit. The proportion of hits is the Interchange Intervention Accuracy (IIA).

DAS learns the rotation matrix via gradient descent, minimizing a cross-entropy loss between the predicted and actual counterfactual outputs. The model weights are frozen; only the rotation matrix is trained.

Replicating the Baseline

I first replicated the MoNLI DAS experiment following the paper's Appendix A.2: fine-tune ishan/bert-base-uncased-mnli on 10K MoNLI examples (5 epochs, lr 2e-5, batch size 32), then train the DAS rotation matrix on layer 9 with intervention size 256 (5 epochs, lr 2e-3, batch size 64, 24K training examples). Three random seeds.

Seed	Factual F1	DAS IIA
42	0.999	0.892
66	0.999	0.872
77	1.000	0.952

Best IIA = 0.952, close to the paper's reported 1.00. The small gap likely comes from minor differences in data sampling or tokenization. Good enough to serve as a baseline.

Test 1: Full Model Randomization — Passed

Question: If we randomize all of BERT's weights, can DAS still find high IIA?

This is the most basic sanity check. A fully random network has task accuracy at chance and no learned representations. If DAS finds high IIA here, it means the method is exploiting the geometry of the 768-dimensional space rather than discovering genuine structure.

I initialized a BERT model with the same architecture but fully random weights (normal distribution, std=0.02, matching BERT's initialization scheme). I then ran DAS with the same settings as the baseline across three intervention sizes.

Condition	Dim 64	Dim 128	Dim 256
Trained model	—	—	0.952
Random model	0.360	0.385	0.393

IIA on the random network is well below chance (0.50) across all intervention sizes. DAS cannot find meaningful alignment in random structure.

Note that the authors' own results on the hierarchical equality task showed IIA climbing to 0.64 when the hidden dimension was 256× the input dimension. In the BERT case, the ratio is more favorable (768-dim hidden, 256-dim intervention), and DAS stays firmly below chance.

Test 2: Cascading Randomization — Passed

Question: How does IIA degrade as we progressively destroy the learned weights?

Following Adebayo et al.'s cascading randomization protocol, I started with the trained model and progressively randomized layers from the top down: first the classifier head, then encoder layer 11, then layers 10–11, and so on until the entire model was random.

Step	Layers Randomized	Task F1	IIA
0	None (baseline)	0.999	0.952
1	Classifier	0.760	0.947
2	+ Layer 11	0.333	0.682
3	+ Layer 10	0.333	0.699
4	+ Layer 9	0.333	0.551
5	+ Layer 8	0.333	0.517
6	+ Layer 7	0.333	0.350
7	+ Layer 6	0.333	0.328
8	+ Layer 5	0.333	0.340
9	+ Layer 4	0.333	0.328
10	+ Layer 3	0.333	0.364
11	+ Layer 2	0.333	0.416
12	+ Layer 1	0.333	0.342
13	+ Layer 0	0.333	0.336
14	+ Embeddings (all)	0.333	0.339

Three regimes emerge:

Classifier only (step 1): IIA barely moves (0.952 → 0.947), even though task F1 drops to 0.76. The causal structure lives in the representations, not in the readout layer. This is a nice validation of the causal abstraction framework's core premise.

Layers 9–11 (steps 2–5): Sharp degradation from 0.95 to 0.52. Randomizing layer 11 alone causes a large drop (to 0.68), and randomizing layer 9 — the intervention site itself — drops IIA further to 0.55. DAS depends on both the learned representations at the intervention site and the downstream layers that read them.

Layers 0–7 (steps 6–14): IIA plateaus around 0.33–0.36, converging to the same floor as the fully random model. Once the layers at and above the intervention site are destroyed, further randomization makes no additional difference.

This is the opposite of what Adebayo et al. found for Guided BackProp, which remained invariant to upper-layer randomization. DAS is genuinely sensitive to learned parameters in a graded, layer-localized way.

Test 3: Causal Model Specificity — Passed

Question: Does DAS discriminate between the correct causal model and incorrect ones?

I tested three conditions, all using the same trained BERT model with the same DAS hyperparameters:

Correct model: The standard "Negation + Lexical Entailment" causal model from the paper.
Shuffled counterfactual labels: Same base-source pairs and intervention structure, but the gold counterfactual labels are randomly permuted.
Random binary variables: Two random binary labels per example (independent of input content), with IIT data constructed from these fake variables.

Condition	IIA	DAS Loss
Correct model	0.952	77.4
Shuffled labels	0.507	679.7
Random binary variables	0.503	2372.5

DAS achieves high IIA only for the correct causal model. Both wrong models land exactly at chance. The rotation matrix is not a powerful enough optimizer to overfit arbitrary counterfactual mappings.

Discussion

DAS passes all three sanity checks cleanly:

Test 01

Model Randomization

IIA ≈ 0.39 on random network (below chance). DAS requires learned weights.

Test 02

Cascading Randomization

IIA degrades monotonically; sharpest drops at intervention site and downstream layers.

Test 03

Causal Specificity

IIA ≈ 0.50 for wrong causal models. DAS discriminates correct structure.

These results stand in interesting contrast to recent findings on other interpretability methods. Adebayo et al. (2018) showed that Guided BackProp is invariant to model parameters. More recently, a February 2026 paper applied similar sanity checks to Sparse Autoencoders (SAEs) and found that frozen baselines, where encoder or decoder weights are randomly initialized and never trained, match fully-trained SAEs on standard evaluation metrics including interpretability scores, sparse probing, and causal editing. The authors conclude that current SAE evaluation metrics are too weak to distinguish genuine feature learning from exploitation of random structure.

DAS does not have this problem. Its theoretical grounding in causal abstraction provides a built-in specificity test: interchange interventions create counterfactuals that must match a specific causal model, not just look interpretable. The rotation matrix is constrained to be orthogonal (preserving distances) and is the only learned component; the neural network is frozen. These constraints appear to be sufficient to prevent the kind of spurious alignment that afflicts less structured approaches.

That said, the tests presented here are clearly not conclusive — I only tested obviously wrong causal models, for instance. A stronger test might involve more plausible but wrong models. I leave this for future work.

Acknowledgments. This experiment was conducted on Google Colab (L4 GPU). The codebase is adapted from Geiger et al.'s original implementation. All code is available at github.com/shengweiming/DASExperiment.

References

Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., & Kim, B. (2018). Sanity checks for saliency maps. NeurIPS 2018.

Geiger, A., Wu, Z., Potts, C., Icard, T., & Goodman, N. D. (2024). Finding alignments between interpretable causal variables and distributed neural representations. Proceedings of Machine Learning Research, 236, 160–187.

Geiger, A., Richardson, K., & Potts, C. (2020). Neural natural language inference models partially embed theories of lexical entailment and negation. BlackboxNLP 2020.

Hewitt, J., & Liang, P. (2019). Designing and interpreting probes with control tasks. EMNLP-IJCNLP 2019.