Assay Codebase Overview¶
Version: 2.15.0 (February 2026) SOTA Status: Bleeding Edge (Judge Reliability, MCP Auth, OTel GenAI, Replay Bundle)
What is Assay?¶
Assay is a Policy-as-Code engine for Model Context Protocol (MCP) that validates AI agent behavior. It provides:
- Deterministic testing: Replay recorded traces without LLM API calls (milliseconds, $0 cost, 0% flakiness)
- Runtime security: Kernel-level enforcement on Linux to block unauthorized tool access
- Compliance gates: Validate tool arguments, sequences, and blocklists before production
Assay replaces flaky, network-dependent evals with deterministic replay testing. Record agent behavior once, then validate every PR in milliseconds.
High-Level Architecture¶
Assay is a Rust monorepo with multiple crates, a Python SDK, and comprehensive documentation.
Core Crates¶
| Crate | Purpose | Key Responsibilities |
|---|---|---|
assay-core | Central evaluation engine | Runner, storage, metrics API, MCP integration, trace handling, baseline/quarantine, providers |
assay-cli | Command line interface | Config loading, Runner construction, test suite execution, reporting |
assay-metrics | Standard metrics library | MustContain, SemanticSimilarity, RegexMatch, JsonSchema, ArgsValid, SequenceValid, ToolBlocklist |
assay-mcp-server | MCP server/proxy | Streaming/online policy enforcement via JSON-RPC over stdio |
assay-monitor | Runtime monitoring | eBPF/LSM integration, kernel-level enforcement |
assay-policy | Policy compilation | Compiles policies into Tier 1 (kernel/LSM) and Tier 2 (userspace) |
assay-evidence | Evidence management | Generates verifiable evidence artifacts for audit/compliance (CloudEvents v1.0, JCS canonicalization, content-addressed IDs) |
assay-registry | Pack Registry client | Secure pack fetching (JCS canonicalization, DSSE verification, OIDC auth, local caching, lockfile v2) |
assay-common | Shared types | Common structs for eBPF/userspace communication |
assay-sim | Attack simulation | Hardening/compliance testing via attack suites |
Python SDK¶
Located in assay-python-sdk/python/assay/:
client.py:AssayClientfor recording traces to JSONLcoverage.py:Coveragefor analyzing policy coverageexplain.py: Human-readable explanations of policy violationspytest_plugin.py: Pytest integration for automatic trace capture
GitHub Action¶
Repository: https://github.com/Rul1an/assay/tree/main/assay-action
Features: - Zero-config evidence bundle discovery - SARIF integration with GitHub Security tab - PR comments (only when findings) - Baseline comparison via cache - Artifact upload
See ADR-014 for design details.
Documentation & Examples¶
docs/: Concepts, use cases, integration guides, reference documentationexamples/: Concrete YAML configs, traces, and scenarios (RAG, baseline gate, negation safety)
Core Components in Detail¶
assay-core Structure¶
The core crate is organized into these main modules:
Engine (engine/)¶
Runner: Central orchestratorrun_suite(): Parallel test execution with semaphorerun_test_with_policy(): Retries, policy checks, quarantine, agent assertionsrun_test_once(): Fingerprinting, cache lookup, LLM call/replay, metrics evaluation, baseline check
Storage (storage/)¶
Store: SQLite wrapper for runs, results, attempts, embeddings, judge cache- Schema: runs, results, attempts, embeddings, episodes/steps (for trace ingestion)
- Methods:
create_run(),insert_result_embedded(),get_last_passing_by_fingerprint()
Trace (trace/)¶
ingest: JSONL traces → databaseprecompute: Pre-compute embeddings and judge results for deterministic, fast runsverify,upgrader,otel_ingest: Schema validation, version migration, OpenTelemetry ingest
MCP (mcp/)¶
- JSON-RPC parsing, tool call mapping to policies, audit logging
mapper_v2: Maps MCP tool calls to policy checksproxy: Intercepts and validates tool calls,ProxyConfigwith logging pathsidentity: Tool identity management (Phase 9) - tool metadata hashing and pinningpolicy:McpPolicywithtool_pinsfor integrity verificationjcs: JCS canonicalization (RFC 8785) for deterministic JSONsigning: Ed25519 tool signing with DSSE PAE encoding (sign_tool,verify_tool)trust_policy: Trust policy loading (require_signed,trusted_key_ids)decision:DecisionEmitterfor tool.decision events, reason codes (P_, M_, S_*)lifecycle:LifecycleEmitterfor mandate.used/revoked events (CloudEvents)tool_call_handler: Central handler integrating policy + mandate authorization
Runtime (runtime/)¶
mandate_store: SQLite-backed mandate consumption trackingAuthzReceiptwithwas_newflag for idempotent retriesRevocationRecordfor mandate cancellation- Deterministic
use_idcomputation (content-addressed SHA256) - Tables:
mandates,mandate_uses,nonces,mandate_revocations authorizer: 7-step authorization flow per SPEC-Mandate §7.6-7.8- Validity window check (with ±30s skew)
- Revocation check (no skew - hard cutoff)
- Scope and kind verification
- transaction_ref verification for commit tools
- Atomic consumption
schema: SQLite DDL for mandate runtime tables (schema v3)
Report (report/)¶
- Output formatters:
console(summary),json,junit,sarif RunArtifacts: Container for run_id, suite, resultsSummary(summary.rs): Machine-readable run summary withschema_version,reason_code_version,exit_code,reason_code,seeds(required;Seedswithorder_seed/judge_seedas string or null via serde_seed),judge_metrics(optional; abstain_rate, flip_rate, etc.).Summary::with_seeds()injects seeds; written to summary.json and reflected in run.json.print_run_footer(console.rs): Prints one lineSeeds: seed_version=1 order_seed=… judge_seed=…and judge metrics line to stderr (CI job summary visibility). Called from assay-cli after run/ci.- run.json / summary.json: Contract per SPEC-PR-Gate-Outputs-v1 (§3.3.1 Seeds, §3.3.2 Judge metrics). Seeds are decimal strings or null for JS/TS precision safety.
Providers & Metrics API¶
providers/: LLM clients (OpenAI, fake, trace replay), embedders, strict mode wrappersmetrics_api.rs: Trait definitions thatassay-metricsimplements
Other Key Modules¶
baseline/: Compares new scores with historical baselinesquarantine.rs: Marks and skips flaky testsagent_assertions/: Enforces sequence and structural expectations on traces (e.g., tool call order)
assay-cli Flow¶
- Entry:
main.rsparsesCliargs → callsdispatch() - Command handling:
dispatch()matches command → calls handler (e.g.,cmd_run()) - Runner construction:
build_runner()createsRunner: - Opens
Store(SQLite) - Creates
VcrCache - Selects LLM client (trace replay or live)
- Loads metrics from
assay-metrics - Configures embedder/judge/baseline if provided
- Execution:
Runner::run_suite()→ parallelrun_test_with_policy()→run_test_once()→ LLM call → metric evaluation → store results - Reporting:
RunArtifacts→ formatters (console/JSON/JUnit/SARIF)
assay-metrics Metrics¶
Metrics are composable building blocks:
- Content metrics:
MustContain,MustNotContain,RegexMatch - Semantic metrics:
SemanticSimilarity,Faithfulness,Relevance(using embedder/judge) - Structure/usage:
ArgsValid,SequenceValid,ToolBlocklist,Usage - JSON validation:
JsonSchemafor argument validation
Integration: CLI loads a standard set via default_metrics(), and policies reference these metrics per testcase.
MCP, Policies, Monitor & LSM¶
Policy Compilation (assay-policy)¶
- Policies are compiled into a
CompiledPolicywith: - Tier 1: Kernel/LSM rules (exact paths, CIDRs, ports)
- Tier 2: Userspace rules (glob/regex, complex constraints)
Monitor & eBPF (assay-monitor, assay-common, assay-ebpf)¶
- eBPF programs run in kernel
- Userspace monitor reads events and applies Tier 1 rules
assay-commoncontains no_std-compatible structs for event types, keys, etc.
MCP Server (assay-mcp-server)¶
- Runs as MCP proxy via stdio (JSON-RPC)
- Inspects tool calls, applies policies, makes deny/allow decisions
- Handles rate limiting, audit logging
Execution Flow (CLI → Core)¶
User Command
↓
CLI (main.rs)
↓
dispatch() → Command Handler
↓
build_runner()
├─→ Store (SQLite)
├─→ VcrCache
├─→ LLM Client (trace replay or live)
├─→ Metrics (from assay-metrics)
├─→ Embedder (optional)
├─→ Judge (optional)
└─→ Baseline (optional)
↓
Runner::run_suite()
↓
Parallel run_test_with_policy()
↓
run_test_once()
├─→ Fingerprinting
├─→ Cache lookup
├─→ LLM call (or replay)
├─→ Metrics evaluation
└─→ Baseline check
↓
Store results
↓
Report (console/JSON/JUnit/SARIF; SARIF truncation at 25k results by default, with sarif.omitted in run/summary when truncated — PR #160)
Key Design Principles¶
- Determinism: Same input + same policy = same result (zero flakiness)
- Statelessness: Validation requires only policy file + trace list
- Policy-as-Code: Uses logic, not LLMs, for evaluation
- Separation of Concerns: CLI handles UX/config, core handles evaluation logic
- Extensibility: Metrics, providers, and policies are pluggable via traits
SOTA Features (January 2026)¶
| Feature | Status | Description |
|---|---|---|
| Judge Reliability | ✅ Audit Grade | Randomized order default, borderline band [0.4-0.6], Adaptive Majority (2-of-3), per-suite policies, E7 Audit Evidence |
| E2.3 SARIF limits | ✅ PR #160 | Deterministic truncation (default 25k), runs[0].properties.assay, sarif.omitted in run/summary; consumers use summary/run for counts |
| MCP Auth Hardening | 🔄 P1 | RFC 8707 resource indicators, alg/typ/crit JWT hardening, JWKS rotation, DPoP (optional) |
| OTel GenAI | 🔄 P1 | Semconv version gating, low-cardinality metrics, composable redaction policies |
| Replay Bundle | ✅ In Progress (E9.1–E9.3) | Manifest, container writer, toolchain capture, path validation, provenance. Module: assay-core/src/replay/ |
| CI Optimization | ✅ Implemented | Skip kernel matrix for pure dep bumps, auto-cancel superseded runs |
| Self-Healing Runner | ✅ Implemented | Health check with cache auto-heal, stale job cleanup, PR prioritization |
Exit Codes & Reason Codes¶
Exit Codes (Coarse, Stable)¶
| Code | Name | When |
|---|---|---|
| 0 | EXIT_SUCCESS | All tests passed |
| 1 | EXIT_TEST_FAILURE | One or more tests failed |
| 2 | EXIT_CONFIG_ERROR | Configuration or user error |
| 3 | EXIT_INFRA_ERROR | Infrastructure or judge unavailable |
| 4 | EXIT_WOULD_BLOCK | Sandbox/policy would block execution |
Reason Codes (Fine-Grained)¶
| Category | Codes | Exit |
|---|---|---|
| Config | E_CFG_PARSE, E_TRACE_NOT_FOUND, E_MISSING_CONFIG, E_BASELINE_INVALID, E_POLICY_PARSE, E_INVALID_ARGS | 2 |
| Infra | E_JUDGE_UNAVAILABLE, E_RATE_LIMIT, E_PROVIDER_5XX, E_TIMEOUT, E_NETWORK_ERROR | 3 |
| Test | E_TEST_FAILED, E_POLICY_VIOLATION, E_SEQUENCE_VIOLATION, E_ARG_SCHEMA | 1 |
Compatibility¶
--exit-codes=v2(default): New exit code mapping--exit-codes=v1: Legacy mapping (exit 3 = trace not found)- Environment:
ASSAY_EXIT_CODES=v1|v2
Note: Always use reason_code in summary.json for programmatic handling, not exit codes.
Extension Points¶
New Metrics¶
- Implement in
crates/assay-metrics/src/following theMetrictrait - Register in factory (e.g.,
default_metrics()) so policies can use them
New CLI Commands¶
- Add to
assay-cli(CLI structure, command handler) - Wire to
build_runner()/Runnerif needed
New Policy Features¶
- Extend policy engine in
assay-core(parser/validator, constraints) - Map to
assay-policyfor Tier ½ compilation
New Python SDK Features¶
- Add thin wrappers in
assay-python-sdk/python/assay/around existing CLI/core functionality
Related Documentation¶
- User Flows - How users interact with the system
- Interdependencies - Crate relationships and interfaces
- Architecture Diagrams - Visual architecture representations
- Entry Points - All interaction points
Architecture Decision Records¶
Key ADRs for understanding the codebase:
| ADR | Topic | Summary |
|---|---|---|
| ADR-006 | Evidence Contract | Schema v1, JCS canonicalization, content-addressed IDs |
| ADR-007 | Deterministic Provenance | Reproducible bundle generation |
| ADR-008 | Evidence Streaming | OTel Collector pattern, CloudEvents out of hot path |
| ADR-009 | WORM Storage | S3 Object Lock for compliance retention |
| ADR-010 | Evidence Store API | Multi-tenant REST API for bundle storage |
| ADR-011 | Tool Signing | Ed25519 x-assay-sig for supply chain security ✅ |
| SPEC-Tool-Signing-v1 | Tool Signing Spec | Formal spec: JCS, DSSE PAE, key_id trust ✅ |
| ADR-013 | EU AI Act Pack | Compliance pack system with Article 12 mapping |
| ADR-014 | GitHub Action v2 | Separate repo, SARIF discipline, zero-config ✅ |
| ADR-015 | BYOS Storage | Bring-your-own S3 storage strategy ✅ |
| SPEC-Pack-Registry-v1 | Pack Registry | Secure pack fetching: JCS, DSSE sidecar, no-TOFU trust ✅ |
See ADR Index for the complete list.