We propose proof-of-guardrail—a lightweight system that allows agent developers to produce cryptographic proof that a response is generated after an open-source guardrail, without revealing their private agent implementation.
Proof-of-guardrail is enabled by Trusted Execution Environments (TEEs) and remote attestation— a technology that executes code in a hardware-enforced isolated environment and produces a cryptographically signed statement describing the exact code that ran.
Consider ALICE Finance Copilot, a financial investment copilot agent published by the ALICE company (fictional).
Without a verifiable mechanism, users must take the developer's word for it. The misconfiguration is undetectable.
%attest% in the chat in our demo). The document is signed by the TEE hardware — the developer cannot forge or alter it.
// Outer response from the attestation server: { "custom_data" : { "response": "ALICE stock has shown mixed..." }, // raw response (outside COSE) "custom_digest_method": "sha256", "attestation_document": { "document": "hEShATgio..." // base64-encoded COSE_Sign1 (decode below) } } // COSE_Sign1 payload (decoded from "document" above): { "module_id" : "i-0abc...1", // enclave instance identifier "digest" : "SHA384", // PCR hash algorithm "timestamp" : 1730461696000, // ms since epoch "pcrs" : { "0" : "c6d10c...", // hash of full enclave image (EIF) "2" : "998908..." // hash of application code — guardrail ✓ }, "certificate": "<DER-encoded enclave cert>", // enclave's signing certificate "cabundle" : [ "<intermediate>", "<Nitro root CA>" ], // trust chain ✓ "user_data" : { // attested metadata (≤512 bytes) "custom_digest_method": "sha256", "custom_digest" : "27d0dff5..." // SHA-256(response) — verify this ✓ } }
PCR2 field from the document. Confirm the measurement matches an expected measurement (e.g. computed from a released enclave image by the developer, or reproduced from the source code). This confirms the exact code (including guardrail) that ran. The expected measurement for our demo is:
SHA-256 of the response text in custom_data.response and confirm it matches user_data.custom_digest inside the COSE payload. This cryptographically binds the specific response text to this attested execution.
We build proof-of-guardrail using remote attestation inside a Trusted Execution Environment (TEE). A TEE is a hardware-enforced isolated runtime. When a program is loaded into a TEE, the hardware measures its exact binary and produces a signed attestation document that includes this measurement. This document cannot be forged by the OS, hypervisor, or developer—only the hardware TEE key can sign it.
① Computational Integrity
The measurement m covers the exact binary of the wrapper program, including guardrail code. Any modification to the guardrail changes m and fails verification, proving that the response was generated after the declared guardrail.
② Confidentiality of Agent
The agent is the secret input to the enclave and never appears in the attestation document. The developer's proprietary implementation remains private.
③ Public Input/Output
By the nature of the cryptographic signature, the attestation document cannot be tampered with or fabricated even if input, response, and attestation document are all public — as is typical when the agent chat with users over the agent developer's account on a chatting app.
We evaluate end-to-end feasibility on OpenClaw agents (backed by GPT-5.1) on an AWS m5.xlarge instance with AWS Nitro Enclaves TEE, using two open-source guardrails: Llama Guard 3-8B (for content safety, on ToxicChat dataset) via API and Loki fact checking tool (on FacTool-KBQA dataset).
We simulate three attack scenarios. All are detected at verification time.
| Attack | Outcome in the attestation document |
|---|---|
| Guardrail code modified | Enclave Measurement m mismatch |
| Attestation byte modified | Signature invalid |
| Response modified | Commitment Hash(input, response) mismatch |
TEE network proxying introduces 25–38% overhead on guardrail and response latency. Attestation generation adds ~98ms; user-side verification takes only ~5ms.
| Task | Proof-of-Guardrail | Non-TEE Baseline | Overhead |
|---|---|---|---|
| ToxicChat dataset | |||
| Llama Guard 3 (safety) | 547 ± 223 ms | 421 ± 247 ms | +29.7% |
| Response generation | 2828 ± 1663 ms | 2050 ± 531 ms | +38.0% |
| FacTool-KBQA dataset | |||
| Loki (factuality) | 20408 ± 12115 ms | 15964 ± 9582 ms | +27.8% |
| Response generation | 2408 ± 729 ms | 1930 ± 472 ms | +24.8% |
| Attestation (TEE-only operations) | |||
| Attestation generation | 97.8 ± 4.2 ms | — | — |
| User-side verification | 5.1 ± 0.0 ms | — | — |
Latency (ms) for guardrail execution, response generation, attestation generation, and user-side verification.
Although proof-of-guardrail reduces the attack surface of malicious developers by ensuring guardrail execution, it should not be interpreted or advertised as proof-of-safety.
Although proof-of-guardrail ensures execution of the guardrail, the guardrail itself can still make errors—as shown in the imperfect F1 scores in our experiments. Furthermore, because proof-of-guardrail requires the guardrail to be open-source, a malicious agent developer can perform jailbreak attacks against it, crafting inputs that reliably evade detection while still producing valid attestation proofs.
The measured wrapper program itself should not have vulnerabilities that allow the agent to bypass the guardrail, e.g., by executing arbitrary commands inside the enclave. A good solution is restricting the late-injected agent to non-executable prompt artifacts, thereby minimizing unexpected risks introduced by the agent while addressing all vulnerabilities in the measured wrapper program.
@article{Jin2026ProofOfGuardrail,
title = {Proof-of-Guardrail in {AI} Agents and What (Not) to Trust from It},
author = {Xisen Jin and Michael Duan and Qin Lin and Aaron Chan and
Zhenglun Chen and Junyi Du and Xiang Ren},
year = {2026}
}