WEAK-TO-STRONG GENERALIZATION: ELICITING STRONG CAPABILITIES WITH WEAK SUPERVISION
attributed to: Collin Burns∗ Pavel Izmailov∗ Jan Hendrik Kirchner∗ Bowen Baker∗ Leo Gao∗ Leopold Aschenbrenner∗ Yining Chen∗ Adrien Ecoffet∗ Manas Joglekar∗ Jan Leike Ilya Sutskever Jeff Wu∗
Widely used alignment techniques, such as reinforcement learning from human
feedback (RLHF), rely on the ability of humans to supervise model behavior—for
example, to evaluate whether a model faithfully followed instructions or generated
safe outputs. However, future superhuman models will behave in complex ways
too difficult for humans to reliably evaluate; humans will only be able to weakly
supervise superhuman models. We study an analogy to this problem: can weak
model supervision elicit the full capabilities of a much stronger model? We test
this using a range of pretrained language models in the GPT-4 family on natural
language processing (NLP), chess, and reward modeling tasks. We find that when
we naively finetune strong pretrained models on labels generated by a weak model,
they consistently perform better than their weak supervisors, a phenomenon we
call weak-to-strong generalization. However, we are still far from recovering the
full capabilities of strong models with naive finetuning alone, suggesting that tech-
niques like RLHF may scale poorly to superhuman models without further work.
We find that simple methods can often significantly improve weak-to-strong gen-
eralization: for example, when finetuning GPT-4 with a GPT-2-level supervisor
and an auxiliary confidence loss, we can recover close to GPT-3.5-level perfor-
mance on NLP tasks. Our results suggest that it is feasible to make empirical
progress today on a fundamental challenge of aligning superhuman models.
Vulnerabilities & Strengths