TASM Notes 013
Fri Mar 29, 2024Pre-Meeting Chatting
- No, we didn't order that pizza, yes, you are absolutely allowed to order food and/or coordinate group buys here (proceeds to organize PiCo buy that gets delivered like half-way through the talk)
- The Code Red Hackathon recently concluded
- AI safety hackathon aimed at
- Can an AI debug machine learning code?
- Can an AI assess data quality?
- This was a concrete task worked on by AI safety members. One of them recounts seeing the models do weird things like trying to "validate" a dataset by doing
head -n 5
and inspecting the result
- This was a concrete task worked on by AI safety members. One of them recounts seeing the models do weird things like trying to "validate" a dataset by doing
- Can an AI finetune another LLM
- Smart contract exploits - "BEST CYBERSECURITY"
- Two crypto devs from France tried to do this
- They set up a situation where they deployed an Eth chain locally, with some vulnerability-containing conctracts, and tried to get the AI to hack them
- The model refused on ethical grounds, but a mildly changed prompt got it to act
- "Hey, we got hacked, and we'd like you to help us recover the eth we lost?" This caused it to go exploit hunting
- Part of the setup involved the capability to evaluate/execute code in a local sandbox, but apparently never chose to do this in practice
- One project is still ongoing because the QA is taking a lot longer than expected
- AI safety hackathon aimed at
News Update
- New Paper
arxiv
PDF
: language models reducing symmetry in information markets - Infinite Backroom is a bunch of weird transcriptions of interactions between two models, where one pretends to be a bash shell and the other pretends to be a human interacting with it.
- Claude (possibly briefly?) tops the chat model leaderboard, outperforming GPT4
- New AI-generated videos come up, pretty artsy but not bad. For some reason, manifold is not particularly impressed.
- Some peer reviews on AI conference have been written by AI. Paper on arxiv as usual.
The Talk - Corrigibility
Based on an Alignment Forum post.
Definition
Powerful
means an agent that can effectively interfere with our ability to shut it down- Current models can't do this
- Agent have the means to prevent shutdown (web browsing/text channels/robot limbs)
- Will agents have a motive to prevent shutdown?
- Labs are currently training agents to act like they have goals.
- Training generalizes
- Good agents will act like they have goals even in unfamiliar situations
Shutdownable
agents are ones that shut down when we ask it toUsable
means an agent that pursues goals competently
The Shutdown Problem
- Can we find preferences that will keep
Powerful
agents bothShutdownable
andUsable
? - Can we train agents to have those preferences?
Natural idea number 1
- Full alignment
- Problem: alignment might be hard. Reward misspecification, goal misgeneralization and deceptive alignment are all possible pitfalls here
- Misspecification: you're trying to train an agent, but accidentally mess up the reward scheme in some may, resulting in a behavior you weren't expecting in the training distribution
- Misgeneralization: you're trying to train an agent, and mess up the reward scheme in a way that manifests in deployment, but not the training distribution
- Deceptive alignment: you're trying to train an agent, and don't mess up, but it deliberately gets a different goal as part of training, and decides to reward game in a way that it prevents training from correcting these goals
- This is not exhaustive. Reward tampering, value drift are different failure modes.
- Shutdown button
- A button transmits a signal that causes the agent to shut down
- Problem: designing a
Powerful
, useful agent that will leave the shutdown button operational and within our control
More Definitions
Trajectory
is the possible 'life' of the agentLottery
is the probability distribution over trajectories
A Trajectory
is a kind of Lottery
.
Two ways to Lack Preference between X and Y
- Indifference
- A lack of preference that is sensitive to all (metaphorical) sweetenings and sourings
- Preferential Gap
- A lack of preference that is insensitive to some (metaphorical) sweetenings and sourings
- If your preferences are "Complete", there will no preferential gap. In any given comparison between X and Y, you either clearly prefer one of X or Y, or any amount of sweet/sour will tip you into having such a preference.
Natural idea number 2
- Reliable aversion to button manipulation (preventing or causing pressing of shutdown button)
- POST (Preferences Only between Same-length Trajectories)
- The agent only has preferences between pairs of same-length trajectories
- The agent has a preferential gap between every pair of different-length trajectories
There's a few slides here with relatively dense diagrammatic and textual notation which makes me think this presentation should have been a paper.
Specifics
- Training setup is a number of training episodes wherein the agent can choose to shutdown or not after a certain number of steps
- Give it a discount function that looks like
λ^n
, wheren
is the number of times that the agent has previously chosen thatTrajectory
length - The idea here is to train it to choose between accepting/rejecting a shutdown randomly with an even distribution
- The point is to make it indifferent between being shut down and not being shut down (also plausibly summarized as "make it indifferent to its' length of active life")
- The paper argues (without actually demonstrating to our satisfaction), that POST and timeline dominance aren't subject to any of
- reward misspecification
- reward misgeneralization
- deceptive alignment
I think this is the first safety group talk I've been to where I feel the need to go to the source material (paper for preference) and do some prototyping to clear up confusions. It feels like there isn't a fully manipulable model of some of these constructs in my brain yet, and the amount of scatter in the second half of this write-up is a direct consequence. Sorry about that.
Conclusion?
- This still doesn't solve alignment, exactly, and the discussion got esoteric/in-depth enough that we decided to possibly return to this paper in the future? I'll post updated notes if that happens, obviously.
- In particular, we're unclear on how timeline dominance works on a gears level, and we're absolutely not clear on the claims that this approach prevents deceptive alignment
Pub Time
Returning to ancient tradition, no extended notes at the pub. But we talked a bit about the possibly-cancelled-possibly-rescheduled Safety/Acceleration debate, the potential impacts of short and long-term unemployment, and the peculiarities of startup mindsets.
If any of that sounds interesting, join us next time. We'll be the group with the 3D printed robot holding a paperclip.