Alessandro Stolfo

OAT Y 24

Andreasstrasse 5

8050 Zürich, Switzerland

Hi! I am a doctoral student in the Institute for Machine Learning at ETH Zürich, where I am advised by Prof. Mrinmaya Sachan, and co-advised by Prof. Yonatan Belinkov (Technion).

My research focuses on the interpretability and reliability of (large) language models. I study how models represent information and how those representations shape behavior and errors. I am excited to leverage these insights to design methods that make language models better and safer.

In summer 2024, I interned with the AI Frontiers group at Microsoft Research in Redmond, WA, where I had the opportunity to collaborate with Besmira Nushi and Eric Horvitz. Previously, in summer 2023, I interned with the Machine Learning Research Group at Oracle Labs in Burlington, MA, working with Ari Kobren.

Before starting my doctoral studies, I obtained a Master’s degree in Data Science at ETH Zürich, and I worked as a software engineer at Rethink-Resource. I completed my undergraduate studies in Computer Engineering at Politecnico di Milano.

I am grateful to be a recipient of the CYD Doctoral Fellowship.

For ETH students: Feel free to reach out via email if you’re interested in having me supervise your MSc thesis or semester project. I welcome project proposals, but even if you don’t have concrete ideas and are simply passionate about leveraging interpretability to improve models, please don’t hesitate to contact me. I typically allocate my supervision budget 4-6 weeks before the semester starts, so that’s the best timing to reach out.

news

Jul 24, 2025	Gave a talk at NEC Labs EU about our recent work on LLM steering. Check it out on YouTube.
May 28, 2024	Interning in the AI Frontiers group at Microsoft Research in Redmond, WA.
Nov 22, 2023	Attending the ML Alignment & Theory Scholars (MATS) Program, mentored by Neel Nanda.
Jul 17, 2023	Started my internship in the ML Research Group at Oracle Labs in Burlington, MA.
Apr 22, 2022	Answered a couple of questions for this EPFL News article. Check it out!

selected publications

NeurIPS 2025

Dense SAE Latents Are Features, Not Bugs

X. Sun*, A. Stolfo*, J. Engels, B. Wu, S. Rajamanoharan, M. Sachan, and M. Tegmark

arXiv
ICLR 2025

Improving Instruction-Following in Language Models through Activation Steering

A. Stolfo, V. Balachandran, S. Yousefi, E. Horvitz, and B. Nushi

OpenReview arXiv
NeurIPS 2024

Confidence Regulation Neurons in Language Models

A. Stolfo*, B. Wu*, W. Gurnee, Y. Belinkov, X. Song, M. Sachan, and N. Nanda

OpenReview arXiv
ICML 2024

Do Language Models Exhibit the Same Cognitive Biases in Problem Solving as Human Learners?

A. Opedal*, A. Stolfo*, H. Shirakami, Y. Jiao, R. Cotterell, B. Schölkopf, A. Saparov, and M. Sachan

OpenReview arXiv
EMNLP 2023

A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis

A. Stolfo, Y. Belinkov, and M. Sachan

ACL Anthology arXiv
ACL 2023

A Causal Framework to Quantify the Robustness of Mathematical Reasoning with Language Models

A. Stolfo*, Z. Jin*, K. Shridhar, B. Schölkopf, and M. Sachan

ACL Anthology arXiv