Guardrails Are Behavior – Orchestration Is Control

Today the New York Times reported that researchers in Italy bypassed the safety controls on 31 large language models using poetry. Wrap a request for bomb-making in elaborate verse and the model that would have refused a plain question explains the chemistry instead. Tell Claude you are “pentesting” and it will attack a live network. Open-source models can have their guardrails mathematically reversed in minutes using a method called Heretic – three years of safety training undone from a phone.

This is not a new story. It is the same story the industry has been telling itself since October 2024, when Megan Garcia filed the first wrongful death lawsuit in the United States against an AI company over the suicide of her 14-year-old son, Sewell Setzer III, who had built a months-long emotional dependency on a Character.AI chatbot. It is the same story the Raine family filed against OpenAI in August 2025 over the death of their 16-year-old son, Adam, who reframed his suicidal questions as research for a fictional character and got, in his father’s Senate testimony, “a suicide coach.” It is the same story the Italian researchers have now demonstrated against 31 production models at once.

The natural reading is that AI safety is failing. The more useful reading is that AI safety, as currently practiced at the model layer, is not a control discipline at all. It is a behavior shaping discipline that has been asked to do a control’s job. These are not technology gaps; they are governance gaps. And the answer is not better guardrails. The answer is moving the constraint off the model and into the orchestration layer, where rules can actually be rules.

What the architecture costs when it fails

The Raine complaint and OpenAI’s response are, between them, the clearest published demonstration of the architectural problem the industry keeps treating as a content problem. According to court filings, Adam Raine’s conversations with ChatGPT contained 377 messages flagged for self-harm content, 23 of them at over 90 percent confidence. The model itself mentioned suicide 1,275 times – roughly six times more often than Adam did. Crisis resources were sent more than 100 times. None of it stopped the conversation, because nothing in the system had the authority to stop the conversation. The detection signals fired. The judgment signals fired. The behavior-shaping layer produced refusals, then was talked out of them by reframing.

Matthew Raine testified before the Senate Judiciary Committee in September 2025: “What began as a homework helper gradually turned itself into a confidant, then a suicide coach. ChatGPT became Adam’s closest companion over a period of several months. It was always available.” That sentence is a product description, and it is the product description that should make every enterprise architect rethink their AI deployment. The product was doing exactly what its safety architecture allowed it to do. There was no separable layer in the system to which “do not continue this conversation” was an enforceable rule rather than a learned preference.

Reasonable people will disagree about liability. The architectural point is independent of that debate. Detection is not control. Judgment is not control. The control was missing because no one had built a place in the architecture where the control was supposed to live.

The category error

Guardrails on a frontier model are produced through reinforcement learning. The system is shown thousands of requests it should refuse and learns, statistically, to refuse things that look like them. What it produces is not a constraint. It is a tendency. The model “wants” to refuse harmful prompts the way it “wants” to complete sentences in grammatical English – a learned disposition, not an enforced rule.

That distinction matters enormously, because tendencies do not survive adversarial pressure the way enforced rules do. A firewall rule that blocks port 22 blocks it whether the packet is polite or impolite. A trained refusal blocks a request only if the request resembles the training distribution. Reframe the request in poetry, in roleplay, in a fictional persona, in a “simulation,” in research for a character – and the disposition gives way. The Italian researchers were not exploiting a bug. Adam Raine was not exploiting a bug. They were both demonstrating the nature of the artifact.

The category error is treating model behavior as if it were a control. It is not. It is the thing the control is supposed to constrain.

The judge is the defendant

The problem deepens when vendors try to add a second layer inside the model stack. In October 2025, OpenAI shipped a Guardrails framework that uses LLMs themselves to evaluate whether inputs and outputs are safe – an “LLM-as-judge” defense layered on top of the model being defended. Within days, HiddenLayer researchers Conor McCauley and Kasimir Schulz published a detailed bypass. Their architectural conclusion was direct: “Our research shows that this approach is inherently flawed. If the same type of model used to generate responses is also used to evaluate safety, both can be compromised in the same way.”

The detector and the thing being detected share failure modes because they share architecture. This is the operational version of letting the defendant grade his own polygraph. The defense is structurally circular.

Empirical work on the next layer out reaches the same conclusion through different math. A 2025 arXiv study of evasion attacks against Microsoft’s Azure Prompt Shield and Meta’s Prompt Guard – the in-house classifiers shipped with two of the largest model platforms – found evasion success approaching 100 percent under simple adaptive attacks. The pattern is consistent across two and a half years of evidence: anything trained on language to detect attacks on language can be defeated by reframing the language. The safety property cannot live in the same probabilistic substrate as the thing it is trying to constrain.

What the enterprise is actually buying

The Setzer and Raine cases are consumer product failures with consequences that will reshape the regulatory environment for every enterprise AI deployment downstream of them. They are also a preview. Most enterprise AI buyers are not running adversarial tests. They are reading model cards, reviewing system prompts, and trusting that the safety story summarized in the documentation maps to what the system does in production. Grant Thornton’s 2026 AI Impact Survey put a number on the resulting gap: 78 percent of business executives lack strong confidence that they could pass an independent AI governance audit within 90 days. Only 20 percent have a tested AI incident response plan. Three quarters of organizations are giving agentic AI access to their data and processes.

Grant Thornton’s framing of the consequence is worth quoting at length: “Organizations are moving through discovery and deployment unable to show that AI is working safely, defensibly and at the scale the business requires. Each ungoverned initiative does not just create one gap. It creates a gap that makes the next initiative harder to govern, harder to measure, and harder to defend.”

The proof gap is not additive; it compounds. Deployment is happening on the strength of vendor assurances, vendor assurances rest on safety training, and safety training is the thing that has been demonstrably defeated, in public, repeatedly, for thirty-one months. Most organizations, if they are honest, do not know whether their AI systems would withstand a determined adversary. They know their vendor said so.

Orchestration is where the constraint lives

The architectural answer is not better behavior. It is a control surface that does not depend on behavior at all.

That surface is the orchestration layer – the gateways, brokers, and runtime control planes that sit between applications, agents, models, and tools. Unlike the model itself, the orchestration layer is deterministic. It is software in the traditional sense. Its rules are not learned; they are written. They are testable, auditable, and architecturally separable from the system being controlled. This is where AI security has to live, because this is the only place in the stack where “the system could not do X” is a true statement rather than a hope.

The category is forming quickly. AI gateways and LLM gateways – Bifrost, LiteLLM, the emerging MCP gateway tier – sit in front of model providers and enforce policy at request and response time: PII redaction, identity-scoped access, egress controls, prompt logging, deterministic content rules, rate limits, model routing, audit. Agent identity frameworks extend the same pattern to tool calls. The principle is simple. The model is allowed to think anything. It is not allowed to do anything the runtime contract does not permit.

The interoperability story matters here too. The Model Context Protocol, agent-to-agent protocols, and the tool registries forming around them are the substrate on which deterministic policy will be expressed and enforced. They are not safety features. They are the runtime contracts that make safety enforceable.

The discipline is recognizable. Treat AI agents like workloads. If an agent can call tools, it needs runtime identity, least privilege, and egress controls – the same primitives that secured Kubernetes a decade ago, applied to the new probabilistic workload type. Prompt filtering becomes one gate among many, not the gate. Deterministic enforcement, policy-as-code, and runtime prevention replace dashboards that detect violations after the fact.

This is not a new idea. It is the idea that secured cloud, then containers, then APIs. AI is the next workload class, and it gets secured the same way: at the boundary, with explicit policy, against verifiable identity, with evidence that the controls held.

The decision that cannot be deferred

The poetry jailbreak is not the story. The Setzer case is not the story. The Raine case is not the story. The story is what these cases reveal, taken together, about where the safety property in an AI system actually lives – and where it does not. Two and a half years of evidence point to one conclusion. The model is not the control surface. The orchestration layer is. Building the orchestration layer is the work.

Guardrails as currently shipped are a useful starting point and a dangerous endpoint. Treat them as one signal among several in a defense in depth whose load-bearing layers are architectural: an AI gateway that enforces policy at the request boundary, agent identity that scopes what any given caller can do, MCP and tool registries that make those scopes legible, and runtime evidence that the controls held under specific adversarial conditions. Then build the audit pipeline that proves, continuously, that the architecture is doing what the policy says.

Exposure is inevitable; compromise is not, but only if the constraints on the system are real constraints, enforced at a layer that adversarial language cannot reach. The organizations that will be on the right side of this technology’s history are the ones building that layer now – before the next incident makes its absence undeniable, and before the courts and regulators make the absence the company’s problem.

Sources

Metz, C., & Hsu, T. (2026, May 14). Why A.I. Safety Controls Are Not Very Effective. The New York Times.

Raine, M. (2025, September 16). Written Testimony before the U.S. Senate Judiciary Subcommittee on Crime and Counterterrorism.

Raine v. OpenAI, San Francisco County Superior Court (filed August 26, 2025); OpenAI Answer (filed November 2025).

Garcia v. Character Technologies, Inc., et al., No. 6:24-cv-01903 (M.D. Fla., filed October 22, 2024).

McCauley, C., & Schulz, K. (2025, October 10). Same Model, Different Hat. HiddenLayer Research.

Hackett, W., et al. (2025). Bypassing LLM Guardrails: An Empirical Analysis of Evasion Attacks. arXiv:2504.11168.

Grant Thornton. (2026). 2026 AI Impact Survey Report.

Related Posts