Buck — LessWrong

LESSWRONG
is fundraising!
LW

Buck — LessWrong

LWLW's Shortform

Buck1d20

I'd tell MATS with this form, then not worry about it further.

Hastings's Shortform

Buck1d62

related: Strong evidence is common

Defending Against Model Weight Exfiltration Through Inference Verification

Buck5d94

Yeah, so a key part of the proposal here is that the verifier needs to know the seed. This seems pretty doable in practice.

No, Americans Don't Think Foreign Aid Is 26% of the Budget

Buck6d22

I was trying to note that the answers are bounded above too, and in this particular case we can infer that at least a quarter of Americans have insane takes here. (Though the math I did was totally wrong.)

No, Americans Don't Think Foreign Aid Is 26% of the Budget

Buck6d62

Sorry, you're totally right.

No, Americans Don't Think Foreign Aid Is 26% of the Budget

Buck6d*15-11

I feel like reporting the median is much simpler than these other proposals, and is probably what should be adopted.

I would note that by the Markov inequality, ~~at least 25% of Americans must think that foreign aid is more than 25% of the budget in order to get the average response we see here.~~ So I think it's reasonable to use the reported mean to conclude that at least a sizable minority of Americans are very confused here.

Catching AIs red-handed

Buck9dΩ12150Review for 2024 Review

Before this post, I'm not aware of anything people had written on what might happen after you catch your AI red-handed. I basically stand by everything we wrote here.

I'm a little sad that there hasn't been much research following up on this. I'd like to see more, especially research on how you can get more legible evidence of misalignment from catching individual examples of your AI's behaving badly, and research on few-shot catastrophe detection techniques.

AI catastrophes and rogue deployments

Buck9dΩ220Review for 2024 Review

The point I made in this post still seems very important to me, and I continue to think that it was underrated at the time I wrote this post. I think rogue internal deployments are probably more important to think about than self-exfiltration when you're thinking about how to mitigate risk from internal deployment of possibly-misaligned AI agents.

A basic systems architecture for AI agents that do autonomous research

Buck9dΩ220Review for 2024 Review

The systems architecture that I described here is still my best guess as to how agents will work at the point where AIs are very powerful.

Since I wrote this post, agent scaffolds are used much more in practice. The infrastructure I described here is a good description of cloud-based agents, but isn't the design used by agents that you run on your own computer like Claude Code or Gemini CLI or whatever. I think agents will move in the direction that I described, especially as people want to be able to work with more of them, want to give them longer tasks, and want them to be able to use their own virtual machines for programming so they don't step on each other's toes all the time.

The terminology I introduced here is used widely by people who I know who think about insider threat from AI agents, but it hasn't penetrated that far outside my cluster as far as I know.

Different senses in which two AIs can be “the same”

Buck9dΩ342Review for 2024 Review

I think the points made in this post are very important and I reference them constantly. I am proud of it and I think it was good that we wrote it.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments