RADAR: Mechanistic Pathways for Detecting Data Contamination in LLM Evaluation
Ashish Kattamuri, Harshwardhan Fartale, Arpita Vats, and 2 more authors
2025
Accepted at NeurIPS 2025 LLM Evaluation Workshop Poster
We present RADAR, a framework that uses mechanistic interpretability to identify contaminated evaluation datasets for large language models — distinguishing genuine reasoning from memorized training data. The system extracts 37 features including surface-level confidence trajectories and deep mechanistic properties such as attention specialization, circuit dynamics, and activation flow patterns. An ensemble classifier achieves 93% overall accuracy, perfect accuracy on unambiguous cases, and 76.7% on challenging borderline examples. Rather than relying on traditional surface-level metrics, RADAR demonstrates how deep mechanistic analysis of model activation patterns and circuit dynamics can reveal whether strong performance stems from authentic reasoning or dataset memorization.