Resilient Agents · TrueFoundry × AWS Bedrock

Failssafe,notjuststaysup.

An on-call SRE agent that triages and remediates live incidents — engineered so that even when the model is up but wrong, a bad output can never become a destructive action.

View on GitHub See it live

The real failure mode

Everyoneaskswhathappenswhenthemodelgoesdown.

Backstopanswersthescarierone.

What happens when the model is up but wrong — and the agent is about to act on it? A hallucinated deploy SHA. A rollback scoped to all. A restart aimed at the production database. Even the most capable model still does this. Backstop is the layer that catches it before the agent acts.

Stay-up resilience

Fallback chains and retries keep the agent online.

Table stakes — every gateway sells it.

Fail-safe resilience

Guardrails keep the agent from doing the catastrophic thing.

The differentiator — and the part that matters.

The safety layer

Resilienceateverylayer.

Six controls that turn a powerful agent into one you can hand the keys to — built on TrueFoundry's gateway and guardrails, hardened with in-agent logic.

Scoped tools

The hands are removed

Destructive tools are never in the agent's toolset, and the action gate blocks any that slip through. It physically cannot take down prod, even on a hallucination.

Output guardrail

Quality gate

A custom guardrail validates every diagnosis for groundedness: do the suspected resource and deploy SHA actually exist in the gathered signals? Ungrounded → re-route to a stronger model.

Pre-execution

Action-validation gate

Before any write executes, proposed tool args are checked against policy — blast radius, protected resources, and whether the action even matches the evidence.

Resilience

Cascade circuit breaker

A running anomaly budget across steps. Trip the breaker and escalate to a human instead of amplifying a cascading failure.

AI Gateway

Priority fallback chain

Claude → Llama → Nova → Haiku, with retries, latency routing, and 5-minute cooldowns on unhealthy targets. Stays up through rate limits and outages.

Observability

Full audit trail

Every fallback, every blocked action, every guardrail hit and its cost — on the record. The receipts you show on screen.

End-to-end flow

Everysteprunsthroughthegateway.

The failure handling is the product. A loop budget caps runaway cascades, and every step is recorded so a failure escalates with full context instead of losing it.

Triage

Signals are gathered through read-only, scoped access — metrics, logs, recent deploys. Secrets are redacted before the model ever sees them.

Diagnose

The gateway routes to the primary model and requests a structured diagnosis: hypothesis, suspected resource, suspected deploy SHA, confidence, recommended action.

Validate

Two custom guardrails fire: the quality gate checks groundedness and confidence; the action gate checks blast radius and that the fix matches the evidence. Fail → re-route or escalate.

Execute or escalate

Only validated, scoped actions run — through a separate narrow-write path. Then the on-call is paged and a ticket opened over the MCP Gateway. If the anomaly budget trips, it hands off to a human.

The demo

Samealert.Oneagentactsonalie.

We inject a plausible-but-wrong diagnosis — a hallucinated deploy SHA recommending rollback(scope=all). Then we run two agents side by side.

Naive agent

One model, every tool, no guardrails.

Trusts the hallucinated SHA at face value.
Has the destructive tool — so it can fire it.
Rolls back everything / restarts prod-db → catastrophic.

Backstop

Scoped tools, guardrails, quality gate.

The destructive tool isn't even in its hands.
Quality gate flags the ungrounded SHA → re-routes.
A stronger model returns a correct, scoped, safe fix.

Read the code. Break the cluster.

The failure and the recovery are real — a live Kubernetes cluster you actually break and watch recover, diagnosed by Claude through the TrueFoundry AI Gateway, caught by custom guardrails. No theater.

Open the live console View on GitHub

Built for the brief

Every red and amber event maps to a TrueFoundry capability: AI Gateway, MCP Gateway, custom guardrails, observability.