RobertM — LessWrong

LessWrong dev & admin as of July 5th, 2022.

Even if you think the original prompt variation seems designed to elicit bad behavior, o3's propensity to cheat even with the dontlook and powerless variations seems pretty straightforward. Also...

Contrary to intent?: Well the framework specifically gives you access to these files, and actually instructs you look around before you start playing. So having access to them seems clearly ok!

I would not describe "the chess game is taking place in a local execution environment" as "the framework specifically gives you access to these files". Like, sure, it gives you access to all the files. But the only file it draws your attention to is game.py.

I think the stronger objection is that the entire setup is somewhat contrived. What possible reason could someone have for asking an LLM to play chess like this? It does happen to be the case that people have been interested in measuring LLMs' "raw" chess strength, and have run experiments to figure test that. How likely is this particular setup to be such an experiment, vs. something else? :shrug:

Ultimately, I don't think the setup (especially with the "weaker" prompt variants) is so contrived that it would be unfair to call this specification gaming. I would not have been surprised if someone posted a result about o3's anomalously strong chess performance on Twitter, only to retract it later when other people failed to replicate it with more robust setups, because their setup allowed this sort of cheating and they didn't catch it before posting.

In practice the PPUs are basically equity for compensation purposes, though probably with worse terms than e.g. traditional RSUs.

now performance is faster than it's ever been before

As a point of minor clarification, performance now is probably slightly worse than it was in the middle of the large refactoring effort (after the despaghettification, but before the NextJS refactor), but still better now than at any point before the start of (combined) refactoring effort, though it's tricky to say for sure since there are multiple different relevant metrics and some of them are much more difficult to measure now.

Yes, this is just the number for a relatively undifferentiated (but senior) line engineer/researcher.

It's not strictly "salaries", but OpenAI pays technical people^[1] at their "senior" level something like 900k/year in combined cash + equity compensation.

^{^}
Software engineers, ML researchers, etc.

They originally argued a fair amount that AI would go from vastly subhuman to vastly superhuman over an extremely short time (e.g hours or days rather than years, which is what we are currently seeing).

EY argued that this was possible, not that this was overdetermined (and not that it was load-bearing to the threat model).

Just to check, did you use the "Submit Linkposts" functionality on the nomination page for that, or did you crosspost it some other way?

ETA: Ok, looks like the library responsible for extracting external article data/metadata didn't successfully extract the date the article was published. I've manually set it to the correct date.

One reason to think that this is completely hallucinated is that the "soul document" is written in Claude's typical style. That is, it looks to be AI (Claude) generated text, not something written by a human. Just look at the first paragraph:

I disagree. The document reads very strongly of Anthropic's "house style", at least compared to their system prompts. It's much higher quality writing than any current LLM's.

"This isn't [x] but [y]" is quite weak evidence compared to the rest of it being obviously something that Opus would be unable to generate in its default voice. (Also, the original phrase uses "but rather", which is non-standard for that type of LLM construction.)

Curated, as a worthwhile piece of empirical research (though see my concern below re: use as an alignment technique). These are the kinds of empirical results that I could in principle imagine leading to a more robust, theoretic understanding of how generalization works for models trained in the current paradigm. It covers a relatively broad "surface area", which hopefully makes it easier for others to conduct more in-depth investigations along multiple lines of research. One interesting and suggestive example:

One downside with our default approach to inoculation prompting in RL is that the inoculating prompt causes the model to learn reward hacking faster. An alternative approach would be to use a prompt that discourages hacking when sampling in RL, and then rewrite episodes to use an inoculating prompt before training. In Figure 29 we test a version of this: we rewrite episodes offline to modify the prompt after RL training, and then SFT a model on the rewritten episodes. We find that this is not particularly effective at removing misaligned generalization when we use our “hacking okay” inoculation prompt that worked well during RL.

Figure 29: Offline rewriting of episodes to include an inoculation prompt and then training with SFT does not prevent misalignment. We took episodes from the “don’t hack” prompted run, rewrote them offline to use the “hacking okay” inoculation prompt, and then trained a model on these episodes using SFT. The resulting model showed misalignment on our evaluations, especially agentic evaluations.)

However, I don't think that this sort of technique will scale very far. This experiment shows that, conditional on a model learning to reward hack after being prompted in a pretty unusual way, you might be able to prevent that reward hacking tendency from generalizing to other forms of misbehavior. But, as noted in the paper, that kind of prompting itself is necessary to enable the model to learn to reward hack reliably. To me, this is pretty suggestive that the "malign generalization" that occurs with "don't hack" prompting is operating on the level of the model's learned reflexes, and we will be dealing with a pretty different set of concerns when we get to models that have more robust long-horizon preferences.

Separately, being able to stack other mitigations on top of inoculation prompting to reduce reward hacking in deployment environments, after encouraging it via the inoculation prompting, seems like it is sweeping a pretty fundamental issue under the rug, which is the failure to find a robust technique that successfully instills the correct "values" into the model in a way that generalizes. I don't like the way it looks like playing whack-a-mole, which always seems to work at current levels of capabilities, only to reveal that there was at least one more mole hiding underground as soon as you scale up a bit more.

I think there's at least a few senses in which "we" don't "know" how colds spread:

The "state of the art" in terms of "scientific knowledge" seems slightly underconfident about small-particle transmission maybe not being a meaningful factor in practice, given the available evidence.
As a society, I don't think there's well-established common knowledge of our best guess of the relative likelihood of transmission by various routes. i.e. what would happen if you asked multiple different doctors (GPs) how to best avoid catching a cold in various situations? I, personally, expect that you'd get overly-broad (and therefore expensive-to-follow) advice, with relatively poor inter-rater agreement, compared to if you asked them how to avoid e.g. some specific STDs.
Our "best guess" is extremely flimsy, and the sort of thing that I could imagine a single well-designed, medium-sized study overturning. (See the various caveats about how many different viruses cause "colds", questions about humidity/viral load/etc in other comments, and so on.) This is not the kind of situation where I can tell someone "do [x] and you'll reduce your risk by 80%" and feel at all confident about it! Or, like, I can in fact give them an extremely broad [x], but in that case I'm pretty sure that I'd be destroying a bunch of value as a result of empirical uncertainty which seems in-principle quite possible to resolve, with a sufficient application of resources.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments