Building an AI Audit Trail Regulators Will Actually Accept

Most “AI audit trails” we have inherited from other teams are application logs with the prompt added as a string. When the regulator’s question is “why did the model decide to deny this customer’s claim?”, those logs cannot answer. The prompt is there. The decision is there. The reasoning that connected them is not.

This is a concrete schema for an AI audit log that holds up to a real regulator’s question. It is the format we ship as standard for clients in BFSI, healthcare, and the public sector. It is opinionated, and the opinions are earned.

What regulators actually ask for

Across RBI inspections, SEBI examinations, and the EU AI Act conformity assessment we have observed in scope so far, the questions cluster:

Reproducibility. Show me the inputs to this decision exactly as the model received them.
Provenance. What data did the model use to make this decision, and where did that data come from?
Counterfactuals. What would the model have decided if input X had been different?
Override frequency. How often do humans override the model? What patterns do you see?
Drift. Has the model’s behaviour changed over time, and how do you know?
Incident reconstruction. A specific decision went wrong. Walk me through it.

A useful audit log answers each of these in seconds, not days.

The minimum useful schema

For every model invocation, log:

{
  "invocation_id": "uuid",
  "timestamp": "ISO-8601",
  "model": {
    "id": "claude-opus-4-7",
    "version": "2026-05-12",
    "provider": "anthropic"
  },
  "caller": {
    "user_id": "...",
    "role": "...",
    "service": "...",
    "purpose_tag": "underwriting-decision"
  },
  "input": {
    "system_prompt_hash": "sha256:...",
    "system_prompt_version": "v3.2.1",
    "user_message": "...",
    "structured_input": { ... },
    "parameters": { "temperature": 0.0, "max_tokens": 2048 }
  },
  "retrieval": [
    {
      "chunk_id": "...",
      "source_document_id": "...",
      "score": 0.81,
      "rank": 1,
      "access_control_context": "...",
      "retrieved_at": "ISO-8601"
    }
  ],
  "tool_calls": [
    {
      "tool_id": "credit-bureau-lookup",
      "tool_version": "v2",
      "input": { ... },
      "output_hash": "sha256:...",
      "duration_ms": 412,
      "authorisation_scope": "..."
    }
  ],
  "output": {
    "raw": "...",
    "structured": { ... },
    "tokens": { "input": 2104, "output": 387 },
    "policy_layer_actions": ["redacted_pii", "passed_safety_filter"]
  },
  "human_override": null,
  "consent_records": ["..."],
  "data_residency": "in-region",
  "linked_invocations": ["..."]
}

The non-obvious entries:

system_prompt_hash and system_prompt_version. You will change the system prompt. The audit log must let you reproduce the exact prompt that produced any past decision.
retrieval[].access_control_context. Records which access policy authorised the retrieval. Critical for proving the model did not ground a response in data the user was not entitled to see.
output_hash on tool calls. Tool outputs may be large, sensitive, or non-deterministic. Hash and store separately; reconstruct only when needed and authorised.
policy_layer_actions. What the guardrail layer did to the input or output. Regulators want to see the controls firing, not infer them.
consent_records. References to the DPDPA / GDPR / sectoral consent records that authorised the use of personal data in this invocation.
linked_invocations. Agent chains span multiple model calls. The audit log links them so you can reconstruct a session.

Storage and integrity

Logs go to append-only storage with cryptographic chaining — each entry references the hash of the previous entry, so post-hoc tampering is detectable. Common stacks: AWS QLDB or DynamoDB with stream verification, GCP BigQuery with immutable partitions, or a self-managed PostgreSQL with pgaudit plus hash chaining if cloud-managed options are not permitted.

Retention follows the longest applicable obligation. For BFSI in India, that is typically eight to ten years. For healthcare under DISHA-aligned regimes, longer.

Queryability

The log is useless if only engineering can read it. The compliance team needs a query surface they can use without writing SQL — typically a small internal application that exposes the audit log behind a structured query interface (“show me all invocations where the model recommended decline and the human overrode to approve, in Q1 2026, for customer segment X”).

Build it. We have not yet seen a compliance team that did not start using the query interface heavily within the first month.

Drift and override monitoring

The audit log feeds two ongoing monitors:

Override rate. Where humans override the model frequently, the model is wrong or the policy is wrong; either way, you have an investigation. Track override rate by decision type, by user, by segment, over time.
Decision distribution drift. The model is approving 8% more applications this month than last month. Is that the model? The data? The customers? The audit log lets you answer.

What to skip

Skip storing raw model weights or attention maps. Regulators do not ask for them; storage cost is real; the explainability benefit is small.

Skip storing every embedding the retrieval layer considered. Storage explodes; reconstructability is preserved by re-running the retrieval against the versioned corpus.

Skip “AI explainability” tooling that does not feed the audit log. Tools like SHAP or LIME on the surface of an LLM are theatre. The audit log itself is the explanation: here is what the model saw, here is what it returned, here is the policy layer that filtered it, here is the human who reviewed it.

Why this is worth doing now

Audit obligations are not retroactive in their generosity. The model that ships today will be the model the regulator asks about in 2028. The audit infrastructure that ships with it determines whether you can answer.

Done at design time, this is two weeks of engineering. Done after the regulator’s notice arrives, it is an unrecoverable position.

If you would like a gap assessment against this schema, we run them as a fixed-scope engagement.

Essay