Experiential Policy Memory

A self-improving procedural agent layer where the LLM invents candidate strategies and a bandit-backed policy memory learns, from real outcomes, which strategies work in which states — online, safely, and without weight updates.

The Problem

Agentic systems have three compounding cost problems:

They fail for structural reasons. Missing steps, wrong tool order, brittle UI/tool interaction, untracked constraints, lack of verification. These aren't model capability issues. They're procedural competence issues.

They figure things out from scratch every time. An agent that successfully navigated a complex workflow yesterday has zero memory of that success today. Every invocation pays the full exploration cost again. Without manual prompt engineering to encode “here's how to do X,” the agent rediscovers the wheel on every run.

The exploration cost is real money. Every failed attempt, every redundant tool call, every retry burns tokens and API calls. An agent that takes 5 attempts to get something right costs 5x what an agent that learned from its first success costs. At enterprise scale across thousands of daily tasks, the waste is enormous.

Current approaches optimize the wrong surface area:

Agent memory (RAG-style): Stores prose, retrieves by semantic similarity. Doesn't preserve the causal structure of competence (state → action → outcome). Useful for knowledge, useless for skill.
Manual prompt engineering: Works but doesn't scale. Someone has to observe failures, diagnose them, and encode fixes as instructions. That's a human doing the learning the system should do itself.
LLM routing products: Treat “paths” as model/tool/temperature configs. That's a bandit over configurations, not over strategies. Helps at the margin but doesn't create procedural competence.
Fine-tuning: Too global, too slow, too risky. Can't safely adapt to per-deployment tool quirks, UI drift, or user-specific preferences without constant retraining.

The gap between “semantic RAG memory” and “full RL with weight updates” is real and underserved. EPM fills it — and the value proposition is both higher success rates and lower cost per successful outcome as the system learns.

The Insight

The minimal structure that creates real learning is:

Canonical semantic state — a typed snapshot of what is known, what is missing, and what environment/tool context applies
Constrained action/plan representation — a structured plan graph that can be executed, compared, and deduplicated independent of wording
Verifiable outcomes — a reward signal that reflects real success, not vibes
Online selection + update — explore/exploit that allocates traffic to strategies that work and retires strategies that don't

This is experiential learning without weight updates. Learning lives in a policy memory substrate, not in model parameters.

How It Works

1. Build canonical state

For each task attempt, construct a typed state snapshot. Example fields for a schedule_meeting task:

Task type
Slots filled/missing (attendees_known, timezone_known, duration_known)
Environment context (calendar provider, auth state, locale)
Recent failure codes (conflict_check_failed, ambiguous_attendee)

The state is normalized, hashed, and clustered for state abstraction.

2. Generate candidate plans

The LLM produces structured plans (not prose):

Step graph (tool calls, questions, checks)
Preconditions and invariants
Commit gates (requires confirmation)
Verification steps

Plans are compiled into a normalized representation so semantically equivalent plans map to the same arm.

3. Select plan via explore/exploit

For the current state cluster, maintain a posterior over success for each known plan. Use Thompson Sampling or UCB:

Exploit: Pick plans with proven outcomes in this state cluster
Explore: Occasionally try new or uncertain plans, inside safe action sets only

4. Execute with safety gates

Hard gates prevent dangerous exploration:

Irreversible actions require explicit confirmation
Required verifications must pass before commit
Tool calls validate schemas and constraints
Sandboxing for UI/system actions

Outcomes are scored by verifier strength:

Strong: “Calendar event exists with 3 attendees and no conflicts”
Medium: “Email draft accepted with minimal edits”
Weak: “User corrected facts” (negative) / “user asked for rewrite” (partial failure)

5. Update policy store

Store the episode, update plan posterior for that state cluster. Decay old evidence, version by environment/tool versions so drift doesn't poison learning.

Why This Wins

Learns the right thing. Strategies, not configs. Can fix “missing step” failures by promoting plans that include the right checks.
Eliminates redundant exploration. An agent that figured out how to do X yesterday doesn't pay the discovery cost again today. The policy store is institutional memory for procedural knowledge.
Direct cost savings. Fewer failed attempts means fewer token/API costs per successful outcome. Explore/exploit converges toward the cheapest successful strategy, not just any successful strategy. At enterprise scale, the savings compound fast — this is a line item CFOs can see.
Replaces manual prompt engineering. Instead of a human observing failures, diagnosing them, and encoding fixes as prompt instructions, the system does this automatically from outcome signals. The human's job shifts from “teach the agent how” to “define what success looks like.”
Per-deployment adaptation. Policy store is conditioned on environment context. No retraining needed for new tool versions or user preferences.
Debuggable. Every decision backed by retrieved evidence (“we chose plan P because 93% success rate in this state cluster over 30 days”).
Safe. Updates are local, versioned, reversible. No weight modification risk.

Hard Problems

State representation

“Build canonical state” does enormous work in one sentence. State abstraction is the hard part of the entire system. Too coarse and different situations collapse — the policy learns noise. Too fine and every situation is unique — the system never accumulates enough data to learn. Who defines the state schema? The developer? The LLM? Is it learned? Detecting aliasing via high reward variance is correct in theory but requires significant data volume per cluster.

Plan deduplication

“Two differently worded plans that do the same thing become the same arm” — how? Semantic equivalence of structured plan graphs is not a solved problem. Without reliable dedup, the bandit has a combinatorial explosion of arms, most semantically identical, and never accumulates enough signal on any single arm.

Verifier coverage

The system needs machine-checkable success signals. Works for scheduling meetings, filing tickets, structured CRUD. Breaks for creative tasks, ambiguous user intent, long-horizon outcomes. Weak signals like user edits may be too noisy to drive posterior updates meaningfully.

Credit assignment

Bandit-over-whole-plan is the wedge, but the real value is step-level learning — when to ask, when to verify, when to branch. Getting there requires strong intermediate reward signals that most deployments won't have initially.

Risks and Mitigations

Risk	Detection	Mitigation
State aliasing	High reward variance for same (state, plan)	Add discriminating slots, split clusters
Reward hacking	Optimize proxy, not reality	Strengthen verifiers, add negative checks, require confirmations
Exploration damage	Unsafe actions during explore phase	Restrict to safe subsets, canary new plans, require verification before commits, maintain rollback
Plan explosion	Too many unique arms, no convergence	Better plan normalization, minimum execution threshold before creating new arm
Drift poisoning	Old policies applied to changed environment	Version policies by environment/tool version, decay old evidence

Success Metrics

First-try success rate by task type and state cluster
Median steps and time to success
Repeat failure rate (same state cluster fails twice → should approach zero)
Cost per successful outcome
Drift resilience (performance under tool/UI changes with policy versioning)
Cold-start time for new deployments using shared policies vs. from scratch

Data Flywheel

The policy store enables a public/private split:

Private policies: Learned from a specific deployment, stay with that deployment. Capture tool quirks, user preferences, environment context.
Shared policies: Aggregated across deployments, anonymized, offered as starter policies to new customers. Eliminates cold start.

Every execution generates training signal → better policies → better next execution → more usage. Competitors can copy the architecture but not the accumulated policy data.