Alessandro Stolfo

prof_pic.jpeg

OAT Y 24

Andreasstrasse 5

8050 Zürich, Switzerland

Hi! I am a doctoral student in the Institute for Machine Learning at ETH Zürich, where I am advised by Prof. Mrinmaya Sachan, and co-advised by Prof. Yonatan Belinkov (Technion).

My research focuses on the interpretability and reliability of (large) language models. I study how models represent information and how those representations shape behavior and errors. I am excited to leverage these insights to design methods that make language models better and safer.

In summer 2024, I interned with the AI Frontiers group at Microsoft Research in Redmond, WA, where I had the opportunity to collaborate with Besmira Nushi and Eric Horvitz. Previously, in summer 2023, I interned with the Machine Learning Research Group at Oracle Labs in Burlington, MA, working with Ari Kobren.

Before starting my doctoral studies, I obtained a Master’s degree in Data Science at ETH Zürich, and I worked as a software engineer at Rethink-Resource. I completed my undergraduate studies in Computer Engineering at Politecnico di Milano.

I am grateful to be a recipient of the CYD Doctoral Fellowship.

For ETH students: Feel free to reach out via email if you’re interested in having me supervise your MSc thesis or semester project. I welcome project proposals, but even if you don’t have concrete ideas and are simply passionate about leveraging interpretability to improve models, please don’t hesitate to contact me. I typically allocate my supervision budget 4-6 weeks before the semester starts, so that’s the best timing to reach out.

news

Jul 24, 2025 Gave a talk at NEC Labs EU about our recent work on LLM steering. Check it out on YouTube.
May 28, 2024 Interning in the AI Frontiers group at Microsoft Research in Redmond, WA.
Nov 22, 2023 Attending the ML Alignment & Theory Scholars (MATS) Program, mentored by Neel Nanda.
Jul 17, 2023 Started my internship in the ML Research Group at Oracle Labs in Burlington, MA.
Apr 22, 2022 Answered a couple of questions for this EPFL News article. Check it out!

selected publications

  1. NeurIPS 2025
    Dense SAE Latents Are Features, Not Bugs
    X. Sun*, A. Stolfo*, J. Engels, B. Wu, S. Rajamanoharan, M. Sachan, and M. Tegmark
  2. ICLR 2025
    Improving Instruction-Following in Language Models through Activation Steering
    A. Stolfo, V. Balachandran, S. Yousefi, E. Horvitz, and B. Nushi
  3. NeurIPS 2024
    Confidence Regulation Neurons in Language Models
    A. Stolfo*, B. Wu*, W. Gurnee, Y. Belinkov, X. Song, M. Sachan, and N. Nanda
  4. ICML 2024
    Do Language Models Exhibit the Same Cognitive Biases in Problem Solving as Human Learners?
    A. Opedal*, A. Stolfo*, H. Shirakami, Y. Jiao, R. Cotterell, B. Schölkopf, A. Saparov, and M. Sachan
  5. EMNLP 2023
    A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis
    A. Stolfo, Y. Belinkov, and M. Sachan
  6. ACL 2023
    A Causal Framework to Quantify the Robustness of Mathematical Reasoning with Language Models
    A. Stolfo*, Z. Jin*, K. Shridhar, B. Schölkopf, and M. Sachan