Discussion about this post

User's avatar
Francesca Gomez's avatar

I found this a very interested read.

I have done some empirical testing on the use of escalation channels for AI systems (https://arxiv.org/abs/2510.05192) - specifically I tested across 10 LLMs in the Anthropic blackmail scenario (Lynch et al., 2025) and found all models would escalate rather than blackmail when an escalation channel was available (blackmail went from 38% -> 1%).

Interestingly, in my testing an externally governed channel with an immediate response performed consistently better than a more general internally governed channel (this was relevant to the scenario as the internal governance in compromised too). I was testing as part of a broader agenda on safety infrastructure within an AI systems envrionment.

I actually think an escalation channel has adds specific benefits in comparison to monitoring - it is cheaper as only escalated events needs review, rather than every event; and it works as a non-adversarial intervention, as it doesn't intervene on a model's actions, rather then model triggers the intervention, so could scale better. I'd be interested in discussing further with you!

No posts

Ready for more?