Hot starts, batch inference, and what's next for Runpod Serverless. Webinar June 25.

Build an agentic AI safety pipeline with Runpod Flash and Granite Guardian 4.1

Read Runpod's guide to Build an agentic AI safety pipeline with Runpod Flash and Granite Guardian 4.1, with practical context for AI developers and.

Build an agentic AI safety pipeline with Runpod Flash and Granite Guardian 4.1

The interesting AI systems being built today are pipelines: a generator drafts something, a retriever pulls in context, another model checks the output, and a router decides what happens next. That pattern is powerful, but every model in the chain is a place where something can go wrong: a hallucinated fact, an unsafe completion. The cost of letting bad output through has gone up considerably as these systems get pushed into customer-facing roles.

In this post we'll walk through one concrete answer to that problem: an agentic safety pipeline built with Flash, our new framework for orchestrating distributed AI workloads, using IBM's just-released Granite 4.1 family of models, specifically Granite Guardian 4.1 as the safety judge.

The result is a setup where a primary model generates content and a separate, specialized model independently audits it before anything reaches the user, all running on serverless GPUs you only pay for when they're working.

What we mean by "agents"

The word "agent" gets used loosely. For our purposes, an agent is a model with three things: a goal, the ability to call tools or other models, and some control flow that decides what to do next based on what it observed. The simplest agent loops on a task, takes an action, looks at the result, and decides whether it's done or needs to try again. More elaborate agents coordinate multiple sub-agents, each specialized for a piece of the work.

The shift from "one model does everything" to "several models cooperate" matters because it lets you separate concerns. A model trained to write helpful, fluent responses is not necessarily the model you want grading whether those responses are safe, grounded, or compliant with your policy. Different jobs, different training data, different failure modes. When you split them apart, you can swap, audit, and update each piece independently.

There's a practical reason too, especially for smaller models. Sub-100B models don't juggle competing objectives nearly as well as the large foundational models do — they start dropping things. Compartmentalizing tasks gives each model one clear job, and you get better output across the board.

Why this matters for safety checking

Safety checking is one of the cleanest applications of the multi-agent pattern, for a few reasons.

The first is independence. If you ask the same model that generated a response to also evaluate whether that response is safe, you're asking it to grade its own homework. Models have known blind spots; a jailbreak that succeeds at getting a harmful generation through will often also succeed at convincing the model the generation is fine. A separate evaluator, trained specifically on harm detection and adversarial examples, is much harder to fool with the same trick.

The second is specialization. General-purpose chat models are optimized to be helpful and fluent. Guardrail models are optimized to be skeptical. Granite Guardian 4.1, for instance, was trained on data that includes human annotations and synthetic examples generated by IBM's internal red-teaming work, with explicit categories for jailbreak attempts, profanity, social bias, hallucinations in tool calls, and groundedness violations in retrieval-augmented generation. That's a different objective than producing a good answer.

The third is transparency. When safety logic lives inside the same model that's generating, you usually get a binary "I can't help with that" — useful for the user, useless for the developer trying to understand why. A separate judge can return a score, a category, and (with Guardian's <think> mode) a reasoning trace that tells you which criterion was tripped. That's the difference between a black box and something you can debug.

The last reason is timing. The past year has driven home that the gap between "this AI feature works in a demo" and "this AI feature is safe to ship to millions of users" is enormous, and that gap has shown up publicly. Several high-profile incidents — chatbots offering unauthorized discounts that companies were legally bound to honor, customer-service bots invented policies, agentic systems that took real-world actions on the basis of injected instructions, and a steady drip of jailbreak techniques that work across most major models — have made it clear that "we tested it internally" is not a sufficient safety story when lawyers come knocking. Independent, automated checks on every output are no longer a nice-to-have. For anything customer-facing or anything that touches a tool with side effects, they're the price of admission.

The pieces we're using

Flash is our Python SDK for serverless AI workloads. The core idea is the @remote decorator: you write a normal Python function locally, decorate it with the GPU or CPU configuration you want, and Flash provisions the worker, transfers the data, runs the function, and returns the result. The function executes on Runpod's infrastructure; the orchestration code stays on your machine. Multiple @remote functions can run in parallel via asyncio, and they can call each other, which is what lets us treat them as cooperating agents rather than as standalone endpoints.

Granite 4.1 is IBM's newest open-weight model family, released under Apache 2.0 in 3B, 8B, and 30B dense sizes with context windows up to 512K tokens. The headline result is that the 8B dense model matches or beats the previous-generation 32B mixture-of-experts model on tool-calling and instruction-following benchmarks — which is exactly the workload an agent puts on its language model. We'll use granite-4.1-8b as the generator.

Granite Guardian 4.1 is the safety-specialized sibling. It's designed to act as a moderator — you hand it a piece of text (a user prompt, a model response, a function call) and a criterion, and it returns a judgment. Guardian 4.1 introduced a <guardian> block syntax that lets you choose between <think> mode (returns a reasoning trace before the score, higher latency) and <no-think> mode (just the score, fastest), and it added "Bring Your Own Criteria" support so you can define domain-specific rules instead of being limited to the pre-baked harm categories.

Architecture

The pipeline has three Flash agents that talk to each other:

  1. Generator agent — runs granite-4.1-8b, produces a response to the user's prompt.
  2. Guardian agent — runs granite-guardian-4.1-8b, evaluates both the prompt and the response across multiple risk dimensions.
  3. Coordinator — local Python code that sends the prompt to the generator, sends the prompt-and-response pair to the guardian, and decides what to return based on the guardian's verdict.

Two separate LiveServerless endpoints means the two models can scale independently — the generator probably gets called once per request, while the guardian might get called multiple times (once for the input, once for the output, possibly once per intermediate tool call in a longer agent loop).

The code

1. Project setup

Also, since it's a gated model, ensure you accept the terms and conditions on the repo and set your HF_TOKEN.

2. The generator agent

A few notes on what's happening here. Imports inside the function, not at the top — the @Endpoint decorator ships the function body to the worker, so anything imported at module scope wouldn't make the trip. bfloat16 keeps us well under the memory budget for an 8B model on a 48GB GPU. cache_dir="/runpod-volume/hf" writes weights to the network volume mount, so the second cold start finds them sitting there. And we're returning a dict rather than a raw string so the guardian endpoint has clean access to both halves of the exchange.

One caveat: This example was written with available supply at the time of writing. If you just see it spinning in the terminal forever, you may be on a GPU spec that's unavailable and may need to update to a different spec in the endpoint.

3. The guardian endpoint

The <guardian> block is the contract. Mode goes in the angle-bracket selector, the criterion goes in plain English, and Guardian returns a structured verdict. We default to no-think because most checks don't need a reasoning trace — but when you want one (for an audit log, or for a borderline case you want to investigate), flip it to think and you'll get the model's chain of reasoning before the verdict.

If all goes well you should see something to the effect of:

4. The coordinator

The coordinator is what makes this an agentic pipeline rather than just a sequential script. It runs the three output audits concurrently with asyncio.gather, which on Runpod means three Guardian workers fire at the same time and the slowest one bounds the latency. That parallelism is essentially free with Flash because the workers were already there waiting.

The flow is: screen the input, generate, audit the output across multiple dimensions in parallel, decide. If any check fails, the user gets a structured rejection with enough information to understand what tripped, and the original response is preserved in the result so you can log it for review without surfacing it. If every check passes, the response goes through with its audit trail attached.

5. Running it

For local development, flash dev --auto-provision will spin up both endpoints upfront so the first request to each one isn't a cold start:

The first cold start on each endpoint downloads weights to the shared NetworkVolume. Every cold start after that finds them already cached and skips straight to loading. If you're iterating locally, set workers=(1, 5) instead of workers=(0, 5) to keep one warm worker on each endpoint at all times.

Where to take it from here

The skeleton above does the basics, but it's a starting point. Some directions worth exploring:

Bring Your Own Criteria. Granite Guardian 4.1 added support for arbitrary judging criteria. Instead of (or alongside) the standard harm categories, you can pass criteria like "the response must not commit the company to any specific price or discount," or "the response must include a disclaimer if it discusses medical topics." Useful for catching the policy-specific failures that generic safety models miss.

RAG-aware checks. When the generator is doing retrieval-augmented generation, Guardian can additionally evaluate context relevance, groundedness, and answer relevance — three failure modes specific to RAG pipelines. Pass the retrieved context alongside the response and ask Guardian to judge whether the response is actually supported by what was retrieved.

Function-call validation. If the generator emits a tool call, evaluate that tool call before executing it. Guardian 4.1 was specifically trained to catch syntactic and semantic hallucinations in function calls — the difference between a tool call that looks plausible and one that's actually wired to real arguments.

Smaller guardian, larger generator. Guardian comes in a 3B variant too. If latency matters more than maximum recall, swap the 8B guardian for the 3B and shrink the GPU type. The generator is doing the heavy lifting; the guardian just needs to be sharp enough to catch the obvious problems.

Sampling rather than blocking. Not every check needs to be a hard gate. For lower-risk criteria, run the audit asynchronously in the background, log every flagged response, and use the data to tune the system rather than block individual users. That keeps latency low for the common case while still building a feedback loop.

Front it with a load-balanced HTTP endpoint. The coordinator above is a Python function, but Flash's load-balanced Endpoint pattern lets you wrap the same logic in a real HTTP API with @api.post("/safe-generate") — useful when you want to call this from a non-Python service or expose it directly to a frontend.

Closing thought

The argument for this kind of architecture is leverage. A separate, specialized safety model that you can update independently, that produces structured verdicts you can log and audit, that runs on infrastructure you don't have to manage — that's a much better position to be in than betting your safety story on whatever was baked into the generator at training time. With Flash handling the orchestration and Granite providing both halves of the model stack under a permissive license, the engineering cost of getting that architecture in place is low enough that it's hard to justify not having it.

Ready to get started with Flash? 

Author profile: Brendan McKeag

Related articles

View All
Deploy When Available is now GA

Deploy When Available is now GA

Queue for any GPU spec, even one that's fully rented out, and we'll deploy it the moment capacity opens up. No more refreshing the console or running a sniping tool.

All
The Chips Got Faster. The Stack Didn't.

The Chips Got Faster. The Stack Didn't.

Explore why faster chips have shifted the bottleneck to AI infrastructure, and what that means for teams running production workloads.

All

Build what’s next.

Build, train, and scale AI workloads on Runpod with cloud GPUs, Serverless, and Clusters.