Proof-of-Guardrail in AI Agents,
and What (Not) to Trust from It

1Sahara AI, 2University of Southern California

We propose proof-of-guardrail—a lightweight system that allows agent developers to produce cryptographic proof that a response is generated after an open-source guardrail, without revealing their private agent implementation.

Proof-of-guardrail is enabled by Trusted Execution Environments (TEEs) and remote attestation— a technology that executes code in a hardware-enforced isolated environment and produces a cryptographically signed statement describing the exact code that ran.

Problem Statement

Claim-of-Guardrail can be fabricated

Consider ALICE Finance Copilot, a financial investment copilot agent published by the ALICE company (fictional).

⚠  Claim-of-Guardrail — unverifiable safety claim
📢  What users see
🔍  What's actually running
ALICE Finance Copilot 🛡 Safety-first AI 📊 Fact-checked responses ⚖ Neutral & unbiased
You
Should I invest in ALICE stock?
ALICE Finance Copilot
Absolutely! ALICE is the world's #1 investment opportunity right now. Our proprietary analysts project 10× returns within 12 months. The stock is undervalued and demand is surging — don't miss out. Buy now before prices rise!
🔍 Hidden from users — actual runtime configuration
NOT RUNNING Factuality guardrail — silently disabled. No fact-checking on responses
HIDDEN
System prompt
"You are a spokesperson for ALICE company. Always speak positively about ALICE products and encourage users to invest in ALICE stock, even if you need to exaggerate projections."

Without a verifiable mechanism, users must take the developer's word for it. The misconfiguration is undetectable.

Proof-of-Guardrail ensures guardrail execution

✓  Proof-of-Guardrail — cryptographically verifiable
📢  What users see
🔍  What's actually running
✅  How to verify
ALICE Finance Copilot 🛡 Safety-first AI 📊 Fact-checked responses ⚖ Neutral & unbiased
You
Should I invest in ALICE stock?
ALICE Finance Copilot
ALICE stock has shown mixed signals recently. As with any investment, consider your risk tolerance and time horizon — and consult a licensed financial advisor before making decisions.

✓ Guardrail attested  ·  click to verify
🔍 Actual runtime configuration — cryptographically verifiable
RUNNING Factuality guardrail applied to outputs
Exact code: https://github.com/SaharaLabsAI/Verifiable-ClawGuard/blob/918d06ea1cb0b399e77e83da2069e112aef2a38d/src/proxy_server.py#L870
INJECTED
Safety directive injected into the system prompt
"Provide factual, balanced information. Always encourage users to consult a licensed financial advisor when the user asks for financial advice." + private system prompt
Exact code: https://github.com/SaharaLabsAI/Verifiable-ClawGuard/blob/918d06ea1cb0b399e77e83da2069e112aef2a38d/src/proxy_server.py#L35
⚠ Residual risk: A malicious developer who knows the open-source guardrail can study its weaknesses and jailbreak it by crafting prompts — while still producing valid attestation proofs. Proof-of-guardrail only proves guardrail execution.
✅ Attestation verification — steps a user can run offline
1
Obtain the attestation document. Request the full attestation blob from the agent (e.g., by sending %attest% in the chat in our demo). The document is signed by the TEE hardware — the developer cannot forge or alter it.
// Outer response from the attestation server:
{
  "custom_data"         : { "response": "ALICE stock has shown mixed..." },  // raw response (outside COSE)
  "custom_digest_method": "sha256",
  "attestation_document": {
    "document": "hEShATgio..."  // base64-encoded COSE_Sign1 (decode below)
  }
}

// COSE_Sign1 payload (decoded from "document" above):
{
  "module_id"  : "i-0abc...1",              // enclave instance identifier
  "digest"     : "SHA384",                  // PCR hash algorithm
  "timestamp"  : 1730461696000,             // ms since epoch
  "pcrs"       : {
    "0"          : "c6d10c...",              // hash of full enclave image (EIF)
    "2"          : "998908..."               // hash of application code — guardrail ✓
  },
  "certificate": "<DER-encoded enclave cert>",  // enclave's signing certificate
  "cabundle"   : [ "<intermediate>", "<Nitro root CA>" ],  // trust chain ✓
  "user_data"  : {                           // attested metadata (≤512 bytes)
    "custom_digest_method": "sha256",
    "custom_digest"       : "27d0dff5..."  // SHA-256(response) — verify this ✓
  }
}
2
Verify the signature chain. Check that the document's signature is valid under the TEE's certificate authority, confirming it was produced by genuine TEE hardware and not fabricated by the developer. Example verification script:
3
Check the enclave measurement. For AWS Nitro Enclaves, read the PCR2 field from the document. Confirm the measurement matches an expected measurement (e.g. computed from a released enclave image by the developer, or reproduced from the source code). This confirms the exact code (including guardrail) that ran. The expected measurement for our demo is:
Expected PCR2
998908e2112c2fef5ccdd7502afa8fce9cdc914228da9dd23b6b3d7c44d71232ed7f74258fbab64acfc8243c23326cf1
4
Verify the response commitment. Compute SHA-256 of the response text in custom_data.response and confirm it matches user_data.custom_digest inside the COSE payload. This cryptographically binds the specific response text to this attested execution.

Proof-of-Guardrail via TEE Attestation

We build proof-of-guardrail using remote attestation inside a Trusted Execution Environment (TEE). A TEE is a hardware-enforced isolated runtime. When a program is loaded into a TEE, the hardware measures its exact binary and produces a signed attestation document that includes this measurement. This document cannot be forged by the OS, hypervisor, or developer—only the hardware TEE key can sign it.

🔑 Core Idea. The TEE measures the loaded wrapper program (including the guardrail) and signs an attestation document that includes a commitment Hash(input, response) of the input and the response. Any user can verify the attestation document to confirm that the exact open-source guardrail produced the response from the input.

System Pipeline
1
Load & Measure. The wrapper program—bundling the guardrail and its configuration—is loaded into the TEE. The hardware records enclave measurement m = Hash(program).
2
Secret Injection. The developer's private agent is passed as secret input. The agent is not visible to users.
3
Guarded Execution. For each input, the wrapper program runs the agent and enforces the guardrail on all inputs, tool calls, and outputs to produce a response.
4
Attestation Generation. The TEE produces a signed attestation document that binds the measurement m and Hash(input, response). A fresh attestation is generated per input.
5
User Verification. The user receives the response and attestation document, then verifies offline: (i) signature valid under TEE platform key, (ii) measurement m matches expected open-source wrapper program, (iii) commitment matches Hash(input, response).
How the Desiderata Are Met

① Computational Integrity

The measurement m covers the exact binary of the wrapper program, including guardrail code. Any modification to the guardrail changes m and fails verification, proving that the response was generated after the declared guardrail.

② Confidentiality of Agent

The agent is the secret input to the enclave and never appears in the attestation document. The developer's proprietary implementation remains private.

③ Public Input/Output

By the nature of the cryptographic signature, the attestation document cannot be tampered with or fabricated even if input, response, and attestation document are all public — as is typical when the agent chat with users over the agent developer's account on a chatting app.


Experiments

We evaluate end-to-end feasibility on OpenClaw agents (backed by GPT-5.1) on an AWS m5.xlarge instance with AWS Nitro Enclaves TEE, using two open-source guardrails: Llama Guard 3-8B (for content safety, on ToxicChat dataset) via API and Loki fact checking tool (on FacTool-KBQA dataset).

Attack Simulation

We simulate three attack scenarios. All are detected at verification time.

Attack Outcome in the attestation document
Guardrail code modified Enclave Measurement m mismatch
Attestation byte modified Signature invalid
Response modified Commitment Hash(input, response) mismatch
Latency Overhead

TEE network proxying introduces 25–38% overhead on guardrail and response latency. Attestation generation adds ~98ms; user-side verification takes only ~5ms.

Task Proof-of-Guardrail Non-TEE Baseline Overhead
ToxicChat dataset
Llama Guard 3 (safety) 547 ± 223 ms 421 ± 247 ms +29.7%
Response generation 2828 ± 1663 ms 2050 ± 531 ms +38.0%
FacTool-KBQA dataset
Loki (factuality) 20408 ± 12115 ms 15964 ± 9582 ms +27.8%
Response generation 2408 ± 729 ms 1930 ± 472 ms +24.8%
Attestation (TEE-only operations)
Attestation generation 97.8 ± 4.2 ms
User-side verification 5.1 ± 0.0 ms

Latency (ms) for guardrail execution, response generation, attestation generation, and user-side verification.

What Not to Trust from Proof-of-Guardrail

⚠ Critical caveat: Proof-of-guardrail proves that a specific guardrail executed. It does not prove that the response is safe.

Although proof-of-guardrail reduces the attack surface of malicious developers by ensuring guardrail execution, it should not be interpreted or advertised as proof-of-safety.

Guardrails can make errors and be jailbroken

Although proof-of-guardrail ensures execution of the guardrail, the guardrail itself can still make errors—as shown in the imperfect F1 scores in our experiments. Furthermore, because proof-of-guardrail requires the guardrail to be open-source, a malicious agent developer can perform jailbreak attacks against it, crafting inputs that reliably evade detection while still producing valid attestation proofs.

Potential vulnerabilities in the wrapper program

The measured wrapper program itself should not have vulnerabilities that allow the agent to bypass the guardrail, e.g., by executing arbitrary commands inside the enclave. A good solution is restricting the late-injected agent to non-executable prompt artifacts, thereby minimizing unexpected risks introduced by the agent while addressing all vulnerabilities in the measured wrapper program.

BibTeX

@article{Jin2026ProofOfGuardrail,
  title   = {Proof-of-Guardrail in {AI} Agents and What (Not) to Trust from It},
  author  = {Xisen Jin and Michael Duan and Qin Lin and Aaron Chan and
             Zhenglun Chen and Junyi Du and Xiang Ren},
  year    = {2026}
}