Experiential Policy Memory
A self-improving procedural agent layer where the LLM invents candidate strategies and a bandit-backed policy memory learns, from real outcomes, which strategies work in which states — online, safely, and without weight updates.
The Problem
Agentic systems have three compounding cost problems:
They fail for structural reasons. Missing steps, wrong tool order, brittle UI/tool interaction, untracked constraints, lack of verification. These aren't model capability issues. They're procedural competence issues.
They figure things out from scratch every time. An agent that successfully navigated a complex workflow yesterday has zero memory of that success today. Every invocation pays the full exploration cost again. Without manual prompt engineering to encode “here's how to do X,” the agent rediscovers the wheel on every run.
The exploration cost is real money. Every failed attempt, every redundant tool call, every retry burns tokens and API calls. An agent that takes 5 attempts to get something right costs 5x what an agent that learned from its first success costs. At enterprise scale across thousands of daily tasks, the waste is enormous.
Current approaches optimize the wrong surface area:
- Agent memory (RAG-style): Stores prose, retrieves by semantic similarity. Doesn't preserve the causal structure of competence (state → action → outcome). Useful for knowledge, useless for skill.
- Manual prompt engineering: Works but doesn't scale. Someone has to observe failures, diagnose them, and encode fixes as instructions. That's a human doing the learning the system should do itself.
- LLM routing products: Treat “paths” as model/tool/temperature configs. That's a bandit over configurations, not over strategies. Helps at the margin but doesn't create procedural competence.
- Fine-tuning: Too global, too slow, too risky. Can't safely adapt to per-deployment tool quirks, UI drift, or user-specific preferences without constant retraining.
The gap between “semantic RAG memory” and “full RL with weight updates” is real and underserved. EPM fills it — and the value proposition is both higher success rates and lower cost per successful outcome as the system learns.
The Insight
The minimal structure that creates real learning is:
- Canonical semantic state — a typed snapshot of what is known, what is missing, and what environment/tool context applies
- Constrained action/plan representation — a structured plan graph that can be executed, compared, and deduplicated independent of wording
- Verifiable outcomes — a reward signal that reflects real success, not vibes
- Online selection + update — explore/exploit that allocates traffic to strategies that work and retires strategies that don't
This is experiential learning without weight updates. Learning lives in a policy memory substrate, not in model parameters.
How It Works
1. Build canonical state
For each task attempt, construct a typed state snapshot. Example fields for a schedule_meeting task:
- Task type
- Slots filled/missing (attendees_known, timezone_known, duration_known)
- Environment context (calendar provider, auth state, locale)
- Recent failure codes (conflict_check_failed, ambiguous_attendee)
The state is normalized, hashed, and clustered for state abstraction.
2. Generate candidate plans
The LLM produces structured plans (not prose):
- Step graph (tool calls, questions, checks)
- Preconditions and invariants
- Commit gates (requires confirmation)
- Verification steps
Plans are compiled into a normalized representation so semantically equivalent plans map to the same arm.
3. Select plan via explore/exploit
For the current state cluster, maintain a posterior over success for each known plan. Use Thompson Sampling or UCB:
- Exploit: Pick plans with proven outcomes in this state cluster
- Explore: Occasionally try new or uncertain plans, inside safe action sets only
4. Execute with safety gates
Hard gates prevent dangerous exploration:
- Irreversible actions require explicit confirmation
- Required verifications must pass before commit
- Tool calls validate schemas and constraints
- Sandboxing for UI/system actions
Outcomes are scored by verifier strength:
- Strong: “Calendar event exists with 3 attendees and no conflicts”
- Medium: “Email draft accepted with minimal edits”
- Weak: “User corrected facts” (negative) / “user asked for rewrite” (partial failure)
5. Update policy store
Store the episode, update plan posterior for that state cluster. Decay old evidence, version by environment/tool versions so drift doesn't poison learning.
Why This Wins
- Learns the right thing. Strategies, not configs. Can fix “missing step” failures by promoting plans that include the right checks.
- Eliminates redundant exploration. An agent that figured out how to do X yesterday doesn't pay the discovery cost again today. The policy store is institutional memory for procedural knowledge.
- Direct cost savings. Fewer failed attempts means fewer token/API costs per successful outcome. Explore/exploit converges toward the cheapest successful strategy, not just any successful strategy. At enterprise scale, the savings compound fast — this is a line item CFOs can see.
- Replaces manual prompt engineering. Instead of a human observing failures, diagnosing them, and encoding fixes as prompt instructions, the system does this automatically from outcome signals. The human's job shifts from “teach the agent how” to “define what success looks like.”
- Per-deployment adaptation. Policy store is conditioned on environment context. No retraining needed for new tool versions or user preferences.
- Debuggable. Every decision backed by retrieved evidence (“we chose plan P because 93% success rate in this state cluster over 30 days”).
- Safe. Updates are local, versioned, reversible. No weight modification risk.
Hard Problems
State representation
“Build canonical state” does enormous work in one sentence. State abstraction is the hard part of the entire system. Too coarse and different situations collapse — the policy learns noise. Too fine and every situation is unique — the system never accumulates enough data to learn. Who defines the state schema? The developer? The LLM? Is it learned? Detecting aliasing via high reward variance is correct in theory but requires significant data volume per cluster.
Plan deduplication
“Two differently worded plans that do the same thing become the same arm” — how? Semantic equivalence of structured plan graphs is not a solved problem. Without reliable dedup, the bandit has a combinatorial explosion of arms, most semantically identical, and never accumulates enough signal on any single arm.
Verifier coverage
The system needs machine-checkable success signals. Works for scheduling meetings, filing tickets, structured CRUD. Breaks for creative tasks, ambiguous user intent, long-horizon outcomes. Weak signals like user edits may be too noisy to drive posterior updates meaningfully.
Credit assignment
Bandit-over-whole-plan is the wedge, but the real value is step-level learning — when to ask, when to verify, when to branch. Getting there requires strong intermediate reward signals that most deployments won't have initially.
Risks and Mitigations
| Risk | Detection | Mitigation |
|---|---|---|
| State aliasing | High reward variance for same (state, plan) | Add discriminating slots, split clusters |
| Reward hacking | Optimize proxy, not reality | Strengthen verifiers, add negative checks, require confirmations |
| Exploration damage | Unsafe actions during explore phase | Restrict to safe subsets, canary new plans, require verification before commits, maintain rollback |
| Plan explosion | Too many unique arms, no convergence | Better plan normalization, minimum execution threshold before creating new arm |
| Drift poisoning | Old policies applied to changed environment | Version policies by environment/tool version, decay old evidence |
Success Metrics
- First-try success rate by task type and state cluster
- Median steps and time to success
- Repeat failure rate (same state cluster fails twice → should approach zero)
- Cost per successful outcome
- Drift resilience (performance under tool/UI changes with policy versioning)
- Cold-start time for new deployments using shared policies vs. from scratch
Data Flywheel
The policy store enables a public/private split:
- Private policies: Learned from a specific deployment, stay with that deployment. Capture tool quirks, user preferences, environment context.
- Shared policies: Aggregated across deployments, anonymized, offered as starter policies to new customers. Eliminates cold start.
Every execution generates training signal → better policies → better next execution → more usage. Competitors can copy the architecture but not the accumulated policy data.