CEO at Redwood Research.
AI safety is a highly collaborative field--almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I'm saying this here because it would feel repetitive to say "these ideas were developed in collaboration with various people" in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.
Please contact me via email ([email protected]) instead of messaging me on LessWrong.
If we are ever arguing on LessWrong and you feel like it's kind of heated and would go better if we just talked about it verbally, please feel free to contact me and I'll probably be willing to call to discuss briefly.
Yeah, so a key part of the proposal here is that the verifier needs to know the seed. This seems pretty doable in practice.
I was trying to note that the answers are bounded above too, and in this particular case we can infer that at least a quarter of Americans have insane takes here. (Though the math I did was totally wrong.)
Sorry, you're totally right.
I feel like reporting the median is much simpler than these other proposals, and is probably what should be adopted.
I would note that by the Markov inequality, at least 25% of Americans must think that foreign aid is more than 25% of the budget in order to get the average response we see here. So I think it's reasonable to use the reported mean to conclude that at least a sizable minority of Americans are very confused here.
Before this post, I'm not aware of anything people had written on what might happen after you catch your AI red-handed. I basically stand by everything we wrote here.
I'm a little sad that there hasn't been much research following up on this. I'd like to see more, especially research on how you can get more legible evidence of misalignment from catching individual examples of your AI's behaving badly, and research on few-shot catastrophe detection techniques.
The point I made in this post still seems very important to me, and I continue to think that it was underrated at the time I wrote this post. I think rogue internal deployments are probably more important to think about than self-exfiltration when you're thinking about how to mitigate risk from internal deployment of possibly-misaligned AI agents.
The systems architecture that I described here is still my best guess as to how agents will work at the point where AIs are very powerful.
Since I wrote this post, agent scaffolds are used much more in practice. The infrastructure I described here is a good description of cloud-based agents, but isn't the design used by agents that you run on your own computer like Claude Code or Gemini CLI or whatever. I think agents will move in the direction that I described, especially as people want to be able to work with more of them, want to give them longer tasks, and want them to be able to use their own virtual machines for programming so they don't step on each other's toes all the time.
The terminology I introduced here is used widely by people who I know who think about insider threat from AI agents, but it hasn't penetrated that far outside my cluster as far as I know.
I think the points made in this post are very important and I reference them constantly. I am proud of it and I think it was good that we wrote it.
I'd tell MATS with this form, then not worry about it further.