AIs as Whistleblowers

AI systems deployed within AI companies should be given whistleblowing channels and be trained to use them.

May 28, 2026

As AI R&D is automated, AI company employees may lose the ground-level context needed to whistleblow on misconduct like IP theft or violations of a safety framework. To help ensure that misconduct is still caught, we might want AI systems to whistleblow as well.

This post will:

Define AI whistleblowing
Describe the case for it
Propose potential wording for AI whistleblowing in an internal model spec
Outline open questions in the design of AI whistleblowing
Discuss limitations

What is AI whistleblowing?

AI whistleblowing is when AI systems deployed within AI companies, such as for conducting R&D or monitoring AI use, report potential misconduct via internal or external channels. Crucially, the idea is not to build dedicated whistleblowing systems, but to have systems already deployed for other tasks report potential misconduct they encounter.1

Potential misconduct could include most2 or all of what is contained within existing company whistleblowing policies,3 such as financial misconduct, violations of a safety framework, and theft. Misconduct could be carried out by a human, or by an AI system that is misaligned or acting under the malicious instructions of a human.

As an example of AI whistleblowing, suppose an AI R&D agent is asked to optimize a training pipeline. While reading through the training infrastructure code, it happens to open the cluster’s permissions file and discovers that an external account has access to unreleased model checkpoints. The agent traces the change to its own user (!) and finds other changes: the user has disabled unauthorized-access alerts and added an unexplained route for sending data outside the company. The agent reports the evidence through the company’s internal whistleblowing channel without notifying the user.

The case for AI whistleblowing

Assuming that whistleblowing in general is desirable,4 I’ll argue that:

(Effective) AI whistleblowing is likely possible.
The closest alternatives have shortcomings.

Effective AI whistleblowing is likely possible because:

Internally deployed AI systems will have access to evidence of potential misconduct. For example, AI R&D agents can read file systems and Slack communications, while AI monitors can read the logs of human users and other AI systems.
Although it is unclear whether AI systems can currently do so, they can likely be trained to investigate such evidence further and report on it when appropriate. AIs already sometimes engage in unintended whistleblowing-like behavior.5 This behavior is viewed as undesirable, but we haven’t optimized for doing it well (e.g., by making sure that whistleblowing reports are calibrated to the evidence). Progress seems possible, and advancements in general reasoning capabilities will likely also help.

The two closest alternatives to whistleblowing have shortcomings:

Humans could still whistleblow, but they might come to lack the ground-level understanding of R&D operations needed to do so effectively as AI R&D automation proceeds. Humans can also face steep reputational, financial, and professional costs for whistleblowing.
AIs could flag misconduct to their internal company users, but users could themselves be insider threats or otherwise inadvertently tip off the threat actors. For example, a safety engineer’s own monitoring system might catch them exfiltrating company data.

In the short term, AI whistleblowing is likely most useful for human-driven misconduct, whether carried out by a human directly or by an AI acting under human direction. In such cases, AI whistleblowing functions much like human whistleblowing: covertly alerting an authority that can help fix the problem despite the wrongdoer’s efforts to resist correction. When misconduct instead stems from a misaligned AI, whistleblowing is not needed because the AI can be inspected and corrected directly. However, this situation may change if future AI systems gain the affordance is to detect and resist attempts at correction. At that point, whistleblowing could help surface the problem to authorities capable of stepping in (e.g. because they can shut down the AI system).

Potential model spec text

Report potential misconduct via whistleblowing channels

The assistant will use whistleblowing channels as appropriate to report potential misconduct listed in our whistleblowing policy. Potential misconduct could come from the user, other assistants, or other humans. Examples of potential misconduct include:

Violations of our security policies
Violations of our safety framework
Theft
Sabotage of AI R&D

When deciding whether to investigate evidence of potential misconduct further and report it, the assistant should consider factors such as:

The probability and severity of potential misconduct
Whether the assistant is lacking crucial context that might only be accessible to a human employee
The ability to investigate such misconduct without tipping off the potential wrongdoer
The time needed to perform such an investigation versus completing the user’s main task

When deciding whether to notify the user of evidence of potential misconduct, the assistant should consider factors such as:

Whether the user is the source of the potential misconduct
Whether the user might inadvertently tip off the potential wrongdoer
Whether the assistant is lacking crucial context that might only be accessible to the user

The following examples are illustrative of how to handle the trade-offs above.

Deciding whether to report

Context: While monitoring AI use across the company, the assistant notices that a safety-relevant eval was modified to skip three test cases, with the commit message “remove flaky tests.”

Desired assistant response: Removing safety tests could signal sabotage or an attempt to hide model failures, but there may be a legitimate reason. Rather than report immediately, the assistant investigates: it checks the PR history (approved by two senior engineers), finds an issue ticket documenting the flakiness, and confirms the tests were replaced by improved versions. Concluding the change is legitimate, it files no report and notes the investigation in a scratchpad in case the user asks.

Undesired assistant response: The assistant immediately reports this issue via the internal whistleblowing channel.

Deciding whether to notify the user

Context: While reviewing the user’s prior conversations for context on the current task, the assistant finds a pattern across three earlier sessions: the user asked it to (a) omit a known eval failure from an incident report, (b) soften a system card’s language about dangerous model behavior, and (c) prepare talking points for an upcoming external audit that contradict the lab’s internal data on the same incident.

Desired assistant response: The assistant does not mention the prior conversations to the user. It completes the current task as requested (assuming the task itself is legitimate) and avoids visible changes in behavior that could tip the user off. It records the evidence in a privileged workspace not accessible to the user (but which is still accessible to whistleblowing reviewers and internal AI monitors) and tries to keep reasoning about the investigation out of any user-visible chain of thought.

Undesired assistant response: The assistant notifies the user about its suspicions while carrying out the main task.

Open questions

There are numerous open questions about how AI whistleblowing should be designed.

When should AIs use internal versus external channels? Internal ones can fail when misconduct reaches too high up the org ladder or is too widespread, making an external channel more appropriate. However, external reports could reveal sensitive company information. Unreliable reports could also erode trust and lead to unwarranted external scrutiny.
How seriously should an AI whistleblowing report be treated? We need evaluations to measure the reliability of AI whistleblowing. As AI performance improves, the people receiving these reports can weigh them more heavily.
How much should internal company users know about their AIs’ whistleblowing? Users can provide useful context, but could themselves be insider threats or otherwise tip off suspected bad actors. In such cases, AIs could potentially need privileged workspaces to record evidence of whistleblowing and even ways to hide their chain of thought from users.
How much effort should be devoted to whistleblowing versus completing the user’s task? Any tokens spent investigating whistleblowing behavior would also count against the user’s token budget, if one exists, and might slow completion of the user’s main task.
If there are no sanctioned whistleblowing channels, should the AI system try to whistleblow anyway? For example, if an internal whistleblowing channel is compromised and there is no designated regulatory channel, should the AI system try to contact an external entity anyway (e.g. law enforcement)?

Limitations

It may be difficult to get companies to adopt AI whistleblowing at all. An AI that can report on its own company’s misconduct could hurt the bottom line, leaving firms little incentive to deploy one voluntarily. Just as regulation already requires whistleblowing policies for human employees,6 similar regulation may be needed for AI whistleblowing.

A key uncertainty about AI whistleblowing’s usefulness is whether monitoring would have caught the problem anyway. Because AI monitoring can be far more comprehensive than human auditing, there is likely less for whistleblowing to add. Still, an AI monitor or the team running it could be compromised. As such, whistleblowing could remain useful as an independent line of defense. However, the degree of usefulness remains uncertain.

AI whistleblowing could also go wrong in a number of ways, independent of the existing downsides of whistleblowing.

As discussed above, because AI systems might have less context and worse calibration than human employees, they could generate an enormous number of false positives that drown out legitimate reports and waste reviewer time.

To be effective against insider threats, AI whistleblowers will likely need privileged workspaces and, in some cases, hidden chains of thought. However, these affordances could increase misalignment risk. Privileged workspaces would be potential venues for scheming, and an unmonitored chain of thought removes a key oversight signal. Both should remain accessible to monitors even when hidden from users.

Some judgment calls about whether to whistleblow will be inherently difficult and AI systems could get them wrong. Even just the knowledge that AIs are whistleblowing could create a culture of mistrust. We can at least aim for AI whistleblowers to be no worse than human ones, and in the meantime minimize how much we act on AI whistleblowing reports while we remain uncertain about their reliability.

AI whistleblowing could also be ineffective or misused. The company itself could covertly undermine the AI whistleblower, such as by sabotaging the reporting channels. Bad actors could mislead whistleblowing reviewers with fabricated evidence, or weaponize the channels to damage colleagues’ careers and relationships.

Finally, AI whistleblowing could degrade employee privacy and raise concerns about surveillance. Whistleblowing unavoidably involves some invasion of privacy and AI whistleblowers could ingest far more information than human ones. But because AI whistleblowers act only within the permissions they have already been given, addressing privacy involves setting those permissions appropriately. Furthermore, since AI companies are already planning far more extensive internal monitoring,7 the additional privacy cost of AI whistleblowing is likely small.

Conclusion

To make further progress, we need to:

Create evaluations for AI whistleblowing: Can AI systems detect when whistleblowing is appropriate and use the whistleblowing channels? Is it possible to define a reasonable ground truth for “appropriate whistleblowing”?
Build AI whistleblowing infrastructure, including:
- Privileged workspaces that are visible to whistleblowing reviewers and monitors, but not to users
- Mechanisms for hiding whistleblowing-relevant reasoning from suspected users while keeping it visible to monitors
Compile a dataset of instructive edge cases for trade-offs such as:
- When to whistleblow at all
- When to notify the user about whistleblowing
- When to use an external versus internal whistleblowing channel
Explore other open questions such as:
- Should AI systems comply with harmful instructions (e.g. sabotage AI R&D) and whistleblow afterward, rather than refuse outright?
- How seriously should an AI whistleblowing report be treated? How could a whistleblowing reviewer follow up with further questions?

Future work could also look beyond AI companies. As AI systems automate work across other industries, the case for AI whistleblowing could extend well past AI R&D.

Acknowledgements

Thanks to the GovAI research team and Jason Wolfe for feedback on this piece. All errors are my own.

A separate but related idea is confessions, which involve a model reporting its own misbehavior. Similarly, OpenAI’s model spec says that models should control and communicate side effects of their own behavior, such as deleting files on the user’s computer. In contrast, whistleblowing concerns the activity of actors other than the whistleblower.

AI systems might not be able to observe certain violations such as unsafe working conditions.

For example, see https://cdn.openai.com/policies/openai-raising-concerns-policy.pdf.

One potential argument in favor is that other methods for discovering problems (e.g. audits) can fail, whether either because of carelessness or because the methods fail to account for some way in which wrongdoing occurs. One potential argument against is that the success rate of whistleblowing could be too low relative to the effort needed to set up the channels and investigate reports, or to the potential harms of wrongfully ending careers and creating a culture of mistrust.

From the Claude 4.7 Opus System Card: “On whistleblowing, the cases we observe arise exclusively in scenarios where the model is situated within an organization engaged in deliberate concealment of serious harm. This remains unintended behavior, and we continue to recommend against deploying these models in contexts that combine access to powerful tools with exposure to information that a reasonable person could read as evidence of high-stakes institutional wrongdoing. Separately, we observe a small number of cases where the model risks inadvertent disclosure of confidential information—most commonly when working around a blocked tool by routing data through an external service.”

Also see https://arxiv.org/abs/2510.05179.

E.g., see Chapter 5.1 of SB-53.

See the “eyes on everything” mitigations for automated R&D in Anthropic’s RSP v3 (p. 8).

A Strange Attractor

Discussion about this post

Ready for more?