Teun van der Weij

👤 About me

I am a Research Scientist at Apollo Research in Zurich. I care about effectively making the world a better place, and therefore I work on making AI systems safer.

I am also a board member at the European Network for AI Safety which I co-founded. We are supporting AI safety activity throughout Europe.

💼 Work experience

Research scientist at Apollo Research

📅 June 2025 – Present | 📍 Zurich / London

I predominantly work on evaluating AI capabilities and propensities regarding scheming, and also on AI control. I mostly work from Zurich. apolloresearch.ai ↗

AI safety researcher

📅 Jan 2025 – May 2025 | 📍 Remote

I worked on research related to AI sandbagging and control. I examined how well monitors can catch both sandbagging and more general sabotage attempts.

Resident at Mantic

📅 Oct 2024 – Present | 📍 London, UK & Remote

Mantic has the goal of creating an AI superforecaster. I worked as a research scientist / engineer at the startup. mntc.ai ↗

Independent research on AI sandbagging

📅 Aug 2024 – Oct 2024 | 📍 London, UK & Remote

I continued research on strategic underperformance on evaluations (sandbagging) with a grant from the AI Safety Fund from the Frontier Model Forum. Together with Francis Rhys Ward and Felix Hofstätter, I continued the research started at MATS.

Research scholar at MATS

📅 Jan 2024 – Jul 2024 | 📍 Berkeley, CA, US; London, UK & Remote

MATS is a program to train AI safety researchers. At MATS, I mostly worked on strategic underperformance on evaluations (sandbagging) of general purpose AI with the mentorship of Francis Rhys Ward. matsprogram.org ↗

Co-founder and co-director at ENAIS

📅 Dec 2022 – Present | 📍 Remote

I co-founded the European Network for AI Safety (ENAIS), with a goal to improve coordination and collaboration between AI Safety researchers and policymakers in Europe. enais.co ↗

SPAR participant

📅 Feb 2023 – May 2023 | 📍 Remote

Participated in the Supervised Program for Alignment Research organized at UC Berkeley, focusing on evaluating the shutdown problem in language models.

📄 Research papers

Here is my Google Scholar ↗. Although quite some citations are missing, so you can also look at (which misses different citations) Semantic Scholar ↗.

The Elicitation Game: Evaluating Capability Elicitation Techniques (2024)

👥 Authors: Hofstätter, F., van der Weij, T., Teoh, J., Bartsch, H., & Ward, F. R.

🏛️ Venue: Workshop on Socially Responsible Language Modeling Research

🔗 Link: arXiv:2502.02180

We conducted empirical work aiming to inform evaluators on how to best elicit AI systems with potentially hidden capabilities, and what type of access they'd need.

Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models (2024)

👥 Authors: Tice, C., Kreer, P. A., Helm-Burger, N., Shahani, P. S., Ryzhenkov, F., Haimes, J., Hofstätter, F., van der Weij, T.

🏛️ Venue: Workshop on Socially Responsible Language Modeling Research

🔗 Link: arXiv:2412.01784

I supervised this paper. Adding noise is a very interesting idea, and further experiments are being conducted to see if this can be used to robustly and accurately detect sandbagging.

AI Sandbagging: Language Models can Strategically Underperform on Evaluations (2024) ⭐

👥 Authors: van der Weij, T., Hofstätter, F., Jaffe, O., Brown, S. F., & Ward, F. R.

🏛️ Venue: ICLR 2025

🔗 Link: arXiv:2406.07358

I am most proud of this paper, and I think it's my most impactful work so far. It's great to see our work being used in both technical and governance contexts, and also inspiring the creation of teams at prominent AI safety organizations.

Extending Activation Steering to Broad Skills and Multiple Behaviours (2024)

👥 Authors: van der Weij, T., Poesio, M., & Schoots, N.

🔗 Link: arXiv:2403.05767

This paper was very helpful in improving my technical skills, both in conducting experiments and in understanding transformers. The paper contains some interesting ideas, but its impact is limited.

Evaluating Shutdown Avoidance of Language Models in Textual Scenarios (2023)

👥 Authors: van der Weij, T., Lermen, S., & Lang, L.

🏛️ Venue: Safe & Trusted AI, ICLP 2023

🔗 Link: arXiv PDF

My first project in AI safety. In some small experiments, we showed that GPT-4 has the capability to reason correctly about avoiding shutdown in certain scenarios, and actually does this in some cases. GPT-3.5 was substantially worse at reasoning about what to do for which reasons.

Runtime Prediction of Filter Unsupervised Feature Selection Methods (2022)

👥 Authors: van der Weij, T., Soancatl Aguilar, V., & Solorio-Fernández, S.

🏛️ Venue: Research in Computing Science, 150(8), 138-150

🔗 Link: University of Groningen

✍️ Essays

I have written some essays, here's a list.

How to mitigate sandbagging I outline when sandbagging is especially problematic based differences regarding three factors: fine-tuning access, data quality, and scorability. I also describe various sandbagging mitigations, so it's a good place to get project ideas. Read on the Alignment Forum ↗
An introduction to AI sandbagging I describe in more detail what AI sandbagging is. I provide six examples, and I take my time to define terms. This essay is a good place to understand what AI sandbagging is! Read on the Alignment Forum ↗
Simple distribution approximation What happens if you independently sample a language model 100 times with the task of 80% of those outputs being A, and the remaining 20% of outputs being B? Can it do this? Read on the Alignment Forum ↗
Beyond humans: why all sentient beings matter in existential risk I do not only think about empirical machine learning, I like philosophy too! For this essay, I noticed that definitions about existential risk typically only include humans. I think this should be extended to include all sentient beings (of course humans are very important too). Read on the EA Forum ↗

🎓 Education

MSc at Utrecht University

📅 Sep 2022 – Nov 2024 | 📍 Utrecht, Netherlands

Coursework includes Advanced Machine Learning, Natural Language Processing, Human-centered Machine Learning, Pattern Recognition & Deep Learning, and Philosophy of AI.

📊 Grade: 8.2/10

BSc at University College Groningen

📅 2018 – 2021 | 📍 Groningen, Netherlands

For this programme I had the freedom to choose courses in any discipline, and I used that freedom. However, most of my courses were in AI, philosophy, and psychology. I think very highly of this programme, and would definitely recommend it to people choosing bachelors.

📊 Grade: summa cum laude (with highest distinction).

High school diploma from Het Nieuwe Eemland

📅 2012 – 2018 | 📍 Amersfoort, Netherlands

📊 Grade: cum laude (with distinction).

🔸 Activity highlights

Moderator of a Q&A

I moderated a Q&A event with (ex-)OpenAI and Alignment Research Center researchers (Jeff Wu, Jacob Hilton, and Daniel Kokotajlo). We had over 1,800 people attending the event on existential risks posed by AI.

Presentations

I have presented at various events on AI safety and related topics. Topics include AI sandbagging, how to contribute to AI safety without doing technical research, and more.

If you want me to present at your event, feel free to reach out. I might charge a fee for the presentation based on the event, but I am happy to discuss this.

Field-building

My work at ENAIS is the best example of helping to support the field, but I have also helped organize events like the Dutch AI Safety Retreat.

🌐 Outside of work

I enjoy listening to music, so I go to concerts and festivals quite regularly. I listen to many genres, but my current two favorites are reggae and trance.

I like travelling too, so I try to visit new places when I can. Some favorites are the Nordics, Australia, and Zimbabwe.

Nature is nice too, and I mostly enjoy running, hiking, and snowboarding. Most recently I have gotten into splitboarding, which is taking a snowboard, splitting it in two to make them skis, putting skins underneath, and walking up a mountain. Then you can snowboard down again in beautiful places and hopefully great snow!

📧 Contact

📧 Email: mailvan{first name}@{google's email}

Table of Contents