Mark Lubin

← Back

Experiential Policy Memory

A self-improving procedural agent layer where the LLM invents candidate strategies and a bandit-backed policy memory learns, from real outcomes, which strategies work in which states — online, safely, and without weight updates.


The Problem

Agentic systems have three compounding cost problems:

They fail for structural reasons. Missing steps, wrong tool order, brittle UI/tool interaction, untracked constraints, lack of verification. These aren't model capability issues. They're procedural competence issues.

They figure things out from scratch every time. An agent that successfully navigated a complex workflow yesterday has zero memory of that success today. Every invocation pays the full exploration cost again. Without manual prompt engineering to encode “here's how to do X,” the agent rediscovers the wheel on every run.

The exploration cost is real money. Every failed attempt, every redundant tool call, every retry burns tokens and API calls. An agent that takes 5 attempts to get something right costs 5x what an agent that learned from its first success costs. At enterprise scale across thousands of daily tasks, the waste is enormous.

Current approaches optimize the wrong surface area:

The gap between “semantic RAG memory” and “full RL with weight updates” is real and underserved. EPM fills it — and the value proposition is both higher success rates and lower cost per successful outcome as the system learns.


The Insight

The minimal structure that creates real learning is:

  1. Canonical semantic state — a typed snapshot of what is known, what is missing, and what environment/tool context applies
  2. Constrained action/plan representation — a structured plan graph that can be executed, compared, and deduplicated independent of wording
  3. Verifiable outcomes — a reward signal that reflects real success, not vibes
  4. Online selection + update — explore/exploit that allocates traffic to strategies that work and retires strategies that don't

This is experiential learning without weight updates. Learning lives in a policy memory substrate, not in model parameters.


How It Works

1. Build canonical state

For each task attempt, construct a typed state snapshot. Example fields for a schedule_meeting task:

The state is normalized, hashed, and clustered for state abstraction.

2. Generate candidate plans

The LLM produces structured plans (not prose):

Plans are compiled into a normalized representation so semantically equivalent plans map to the same arm.

3. Select plan via explore/exploit

For the current state cluster, maintain a posterior over success for each known plan. Use Thompson Sampling or UCB:

4. Execute with safety gates

Hard gates prevent dangerous exploration:

Outcomes are scored by verifier strength:

5. Update policy store

Store the episode, update plan posterior for that state cluster. Decay old evidence, version by environment/tool versions so drift doesn't poison learning.


Why This Wins


Hard Problems

State representation

“Build canonical state” does enormous work in one sentence. State abstraction is the hard part of the entire system. Too coarse and different situations collapse — the policy learns noise. Too fine and every situation is unique — the system never accumulates enough data to learn. Who defines the state schema? The developer? The LLM? Is it learned? Detecting aliasing via high reward variance is correct in theory but requires significant data volume per cluster.

Plan deduplication

“Two differently worded plans that do the same thing become the same arm” — how? Semantic equivalence of structured plan graphs is not a solved problem. Without reliable dedup, the bandit has a combinatorial explosion of arms, most semantically identical, and never accumulates enough signal on any single arm.

Verifier coverage

The system needs machine-checkable success signals. Works for scheduling meetings, filing tickets, structured CRUD. Breaks for creative tasks, ambiguous user intent, long-horizon outcomes. Weak signals like user edits may be too noisy to drive posterior updates meaningfully.

Credit assignment

Bandit-over-whole-plan is the wedge, but the real value is step-level learning — when to ask, when to verify, when to branch. Getting there requires strong intermediate reward signals that most deployments won't have initially.


Risks and Mitigations

Risk Detection Mitigation
State aliasing High reward variance for same (state, plan) Add discriminating slots, split clusters
Reward hacking Optimize proxy, not reality Strengthen verifiers, add negative checks, require confirmations
Exploration damage Unsafe actions during explore phase Restrict to safe subsets, canary new plans, require verification before commits, maintain rollback
Plan explosion Too many unique arms, no convergence Better plan normalization, minimum execution threshold before creating new arm
Drift poisoning Old policies applied to changed environment Version policies by environment/tool version, decay old evidence

Success Metrics


Data Flywheel

The policy store enables a public/private split:

Every execution generates training signal → better policies → better next execution → more usage. Competitors can copy the architecture but not the accumulated policy data.