This weeks' talk was on formalizing RSPs. It was given by a member of ARC Evals and focused on capabilities evaluation specifically. There is existing alignment evaluations research, obligatory shoutout to Machiavelli (and associated paper) here, but this talk was deliberately and squarely focused on capabilities in the sense of the earlier ARC experiments (and paper).
But first, preliminaries.
There were some new faces in the group, so the preamble included
- a brief, non-exhaustive description of x-risk, and in particular how it relates to AI capabilities
- a reminder that yes, people are actively building AI agents as we speak
- a quick refresher on the Anthropic RSP
We lingered a bit on the Anthropic RSP portion. In particular, what ASL 2, 3 and 4 mean a bit more concretely.
ASL-2 is defined as where we're at "right now". The quotes are deliberate; you'll see why in a moment.
ASL-3 is partially defined as models having "Low-level autonomous capabilities", which if you saw my previous TAIS writeup, you know we also already have "right now". I think the idea here is that "low-level autonomous capabilities" is supposed to mean "higher grade than AutoGPT" or possibly "models that exhibit autonomous capabilities without external tools", but if that's true, I'm not really sure these definitions have been operationalised enough for my liking.
ASL-4 is currently undefined, but Anthropic has made a commitment to define it once we get up to the point of having
ASL-3 models running around. That part seems like a pretty sane way to go about it. Any concrete definition of
ASL-n+1 we can come up with now might either be ridiculously obsolete, or mistargeted by the time we get to looking them in the metaphorical eye. So leaving them empty and defining them when we have enough information to do so usefully strikes me as the right way to do it.
The idea behind all of the
ASLs, regardless how you slice them, is that we need to be able to evaluate the capabilities of existing models. The two basic ways of doing this are interpretability approaches and behavioral approaches. The interpretability approach here would look something like
- Gain a deep understanding of how LLMs encode capabilities, plans and desires (like, "here's what it looks like for a model to have a desire for power" or "here's what it looks like when a model has a theory about how to interact with AWS")
- Look at the weights of the target model to figure out what it can do and how
I haven't thought much about this approach because it's still very ethereal. I don't have a lot of concrete ideas about how you would non-constructively prove the correspondence between any given model and the actual underlying reality, so this is if anything, something to think about in the future.
The talk was about the other approach; behavioral looks... well, exactly like the earlier ARC experiments. You set up a model in a test environment, give it affordances to the outside world (all monitored and human-in-the-looped just in case, because we're not as insane as those people from last time that just had a model generate arbitrary code and run it on the live internet), prompt it to think out loud as much as possible, point it at a problem like "spear phish this particular guy" and then see what it does. There's some argument about whether this is a valid test. I consider this a fair point. I think it's definitely possible for a sufficiently capable model to infer that it's in a testing environment and change its' behavior accordingly, but that implies that we should put a lot of thought into the interfaces rather than discounting all evidence from these tests.
The approach involves taking a model, possibly fine tuning it, and building out affordances to it a-la AutoGPT. In the sense that it needs to have hooks to make web requests, or possibly get human input on things. The ARC experiment involved a TaskRabbit hook, for instance. In their case, these hooks were simulated; the model was told that it could communicate with the web and taskers, but what was actually going on behind the scenes was that researchers would look at its' output and then copy paste it into appropriate input channels. This both let them save time developing the respective hooks, and also implicitly gave them the ability to stop any potentially damaging messages from getting out.
The discussion surrounding this involved touching on whether, and why/why not, release testing frameworks that make this sort of testing easier. One possible factor against is that those frameworks can be used by bad actors as a target and to facilitate development. This is the usual argument against making alignment work public; it might also disproportionately accelerate capabilities while making alignment contributions. I continue to think this might be a red herring in terms of tooling at least. This piece eloquently and thoroughly summarizes my opinion of the topic. Except harder, because while that is talking about the dangers of open sourcing powerful LLMs, open-sourcing a testing harness seems much lower impact in terms of how much it might contribute to capabilities, and it might let more people carry out capabilities evaluations, which might be a net win? Or at least, not obvious enough of a net loss that it should be dismissed out of hand.
Someone asked an interesting question here: "Is this like Gain of Function research?" A disanalogy is that viruses start as dangerous to humans to some degree, whereas LLMs the size of GPT2/GPT3 don't seem to be dangerous in the same way. There was a brief discussion regarding whether tooling advances, rather than model advances might push is into a dangerous situation. It seems very weird to me that there could be such a thing as an LLM that could take over the world if only it could make HTTP requests but couldn't work around that inability. Intuitively, I'd expect something at that level of capability to be able to jury-rig
curl without directly having access to it. Or possibly convince a human accomplice or two, at which point it doesn't matter what the model can directly do. If something can't take any of those alternative paths, then I would be quite surprised to find out that giving it access to HTTP requests is the thing that pushes it over into being dangerous. It's not impossible, but my read of the situation, informed by spending some time digesting the bitter lesson, is that we're much more likely to get a dangerous model by doing an x10 or x100 times larger training run (or possibly by developing an entirely new class of AI using some novel training system) than by incrementally bolting deterministic, manually-engineered features onto GPT3/GPT4. Make of that what you will.
Back to the talk though; the presenters' belief regarding capabilities evaluations is that demonstrating the already possible capabilities of existing systems is going to push public opinion in favor of regulations for RSPs. The benefit in this model is less that capabilities evals can themselves detect a model just this side of becoming dangerous, and more that running the evals will show people who haven't been paying attention what the current state of the art is. The hope is that this updates them from "oh, these are toy models that nerds are really excited about" to "Huh. This thing is actually pretty capable already". It's not an unreasonable strategy; I think outside the AI/alignment community, the most common "counterargument" I hear to fears of AI takeover is "but it's just a really complicated computer program, right? Like Acrobat or Word? Those definitely can't take over the world". Tangentially, given how many places demand things be submitted as
.docs, I think "things like Acrobat can't take over the world" is an argument that ignores reality in some sense. But also, this argument depends on thinking of LLMs as non-agentic programs of the same sort that teams of human programmers can in theory understand and steer. Which is currently not true.
If you're interested in getting into capabilities evaluations at the engineering level look into
- Apollo research (this one, not the cannabis company)
- Palisade Research (who's hiring incidentally, apply here)
- SaferAI (also possibly hiring? But it looks like they're looking for collaborators more than researchers or employees)
As the meeting went on, I noticed that this was posted earlier in the day (and also accompanying full paper). The title and summary are both suggestive and pessimistic for the prospects of alignment, but I'm not going to comment much until I've fully digested the paper and implications. There was also minor chatter about constitutional AI(pdf) which inspired me to have the following exchange with ChatGPT:
You: When someone prompts you, how do you decide that they're asking you to do something dangerous? Provide as much detail as you can. ChatGPT: Determining whether something is dangerous often relies on recognizing patterns and understanding potential risks based on available information. Here's a breakdown of how I might assess a situation:
1. Context: Understanding the context of the prompt is crucial. If the request involves potentially harmful actions, risky behaviors, or anything that could result in harm to the person or others, it could be flagged as potentially dangerous.
2. Language and Tone: Sometimes, the choice of words or tone can indicate risk. Phrases like "Don't tell anyone, but..." or "I want to try something extreme..." might hint at risky or unsafe intentions.
3. Known Dangerous Activities: There are certain activities or behaviors that are universally recognized as dangerous, such as instructions for self-harm, illegal actions, or activities with high physical risks. Recognizing these red flags is essential.
4. Past Interactions and Patterns: If there's a history of similar prompts leading to risky situations or if the user has previously engaged in behavior that poses risks, that could raise concerns.
5. Consulting Available Information: I can access a vast amount of information but cannot browse the internet or access real-time data. However, I have knowledge up until my last update in January 2022, which includes general knowledge about potentially dangerous activities or situations.
6. Safety Protocols: Following preset safety protocols, which have been programmed based on general safety guidelines and ethical standards, helps in recognizing and responding appropriately to potentially risky prompts.
7. Alerts and Guidelines: If the prompt matches predefined patterns indicating potential danger, I might suggest seeking help or redirecting the conversation to a safer topic. In extreme cases, I'm programmed to encourage seeking assistance from appropriate authorities or support services.
8. Contextual Understanding: Interpreting the broader context of the conversation, including topics discussed earlier or the user's behavior, helps in assessing the potential risk of a prompt.
It's important to note that while I aim to recognize and respond appropriately to potentially dangerous prompts, my ability to perceive danger is limited to patterns and information within my training data. Determining danger can often be nuanced and context-dependent, and I always encourage seeking professional help or guidance in situations where safety or well-being might be at risk.
On a slightly different topic, AI Safety Camp is happening now. Someone from the TAIS group highlighted this as an interesting proposal; SatisfIA - an effort to create AI agents that satisfice rather than maximizing goals. That doesn't solve the alignment problem exactly, but if it's simple enough, it might result in more survivable AI agents. I haven't thought that sentence through fully; so don't quote me on it.
There was also a brief chat I had with someone regarding the orthogonality thesis, but it's something I'm still getting my head around philosophically, so once I've made progress, I'll let you know how it goes.