Schedule

April 28th, 2025

Please submit questions for our panelists here.

Time Event
8:30 AM Opening
8:35 AM Invited Speaker (Online): Baharan Mirzasoleiman
9:20 AM Invited Speaker: Pavel Izmailov
10:00 AM Break / Poster Session / Informal Networking
10:30 AM Invited Speaker: Chelsea Finn
11:15 AM Invited Speaker: Stefano Sarao Mannelli
12:00 PM Break / Poster Session / Informal Networking
1:30 PM Invited Speaker: Aditi Raghunathan
2:15 PM Invited Speaker: Andrew Lampinen
3:00 PM Break / Poster Session / Informal Networking
3:30 PM Invited Speaker: David Lopez-Paz
4:15 PM Oral Presentations — Announcing Best Papers
5:10 PM Panel



Talk Details

Pavel Izmailov

Title: Weak-to-strong generalization

Abstract: Widely used alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on the ability of humans to supervise model behavior—for example, to evaluate whether a model faithfully followed instructions or generated safe outputs. However, future models will behave in complex ways too difficult for humans to reliably evaluate. Humans will only be able to weakly supervise superhuman models. We would then need the models to correctly generalize from this weak and limited supervision, avoiding shortcut solutions. We study an analogy to this problem: can weak model supervision elicit the full capabilities of a much stronger model? We test this using a range of pretrained language models in the GPT-4 family on natural language processing (NLP), chess, and reward modeling tasks. We find that when we naively finetune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors, a phenomenon we call weak-to-strong generalization. However, we are still far from recovering the full capabilities of strong models with naive finetuning alone, suggesting that techniques like RLHF may scale poorly to superhuman models without further work. We find that simple methods can often significantly improve weak-to-strong generalization: for example, when finetuning GPT-4 with a GPT-2-level supervisor and an auxiliary confidence loss, we can recover close to GPT-3.5-level performance on NLP tasks. Our results suggest that it is feasible to make empirical progress today on a fundamental challenge of aligning superhuman models.

Andrew Lampinen

Title: The real shortcuts were the representations we learned along the way

Abstract: In this talk, I will outline how feature complexity shapes dynamics of representation learning and generalization in the presence of spurious features. First, I will briefly show a few examples of simplicity biases, and how they relate to model representations. I will then discuss how simplicity biases persist in learned representations even once more complex features have been learned, and how this can affect downstream learning (and interpretability). However, understanding these biases also gives us tools to understand and improve model generalization. I will illustrate these by showing how adding explanatory signals can shape model generalization when features are confounded. Finally, I will close with two examples that take inspiration from cognitive science to illustrate that shortcuts are not strictly harmful. The first is a case study of logic problems where humans and language models both use shortcuts in ways that can be rationalized. The second is an example where shortcut features actually improve generalization, even to distributions where the shortcuts no longer work, by changing the way models represent the input. I hope that this talk will convince you to take a representational perspective on simplicity, shortcuts, and spurious correlations.

Aditi Raghunathan

Title:Catastrophic overtraining of language models

Abstract:Modern language models follow a two-stage paradigm: extensive pre-training on uncurated web text, followed by targeted post-training (e.g., instruction tuning) to specialize their skills. Conventional wisdom holds that scaling up pre-training data enriches the learned representations and improves the robustness of post-training. Our recent findings overturn this assumption: beyond a critical token threshold, additional data drives models into catastrophic overtraining—a regime in which more pre-training can actually degrade downstream performance.This talk will unpack the mechanisms behind this counterintuitive phenomenon and outline mitigation strategies—showing why “just add more pre-training data” is no longer a sure route to more robust models.