TASM Notes 013

Fri Mar 29, 2024

The Talk - Corrigibility

Based on an Alignment Forum post.


The Shutdown Problem

Natural idea number 1

  1. Full alignment
    • Problem: alignment might be hard. Reward misspecification, goal misgeneralization and deceptive alignment are all possible pitfalls here
    • Misspecification: you're trying to train an agent, but accidentally mess up the reward scheme in some may, resulting in a behavior you weren't expecting in the training distribution
    • Misgeneralization: you're trying to train an agent, and mess up the reward scheme in a way that manifests in deployment, but not the training distribution
    • Deceptive alignment: you're trying to train an agent, and don't mess up, but it deliberately gets a different goal as part of training, and decides to reward game in a way that it prevents training from correcting these goals
    • This is not exhaustive. Reward tampering, value drift are different failure modes.
  2. Shutdown button
    • A button transmits a signal that causes the agent to shut down
    • Problem: designing a Powerful, useful agent that will leave the shutdown button operational and within our control

More Definitions

A Trajectory is a kind of Lottery.

Two ways to Lack Preference between X and Y

Natural idea number 2

There's a few slides here with relatively dense diagrammatic and textual notation which makes me think this presentation should have been a paper.


I think this is the first safety group talk I've been to where I feel the need to go to the source material (paper for preference) and do some prototyping to clear up confusions. It feels like there isn't a fully manipulable model of some of these constructs in my brain yet, and the amount of scatter in the second half of this write-up is a direct consequence. Sorry about that.


Pub Time

Returning to ancient tradition, no extended notes at the pub. But we talked a bit about the possibly-cancelled-possibly-rescheduled Safety/Acceleration debate, the potential impacts of short and long-term unemployment, and the peculiarities of startup mindsets.

If any of that sounds interesting, join us next time. We'll be the group with the 3D printed robot holding a paperclip.

