Replay Engine
Deterministic replay is Paprika's core differentiator. Re-execute any previous run using recorded outputs instead of making live API calls. No external side effects. If behavior diverges, Paprika raises ReplayMismatchError with exact step and hash diff.
What Replay Does
Given an ExecutionRecord, replay re-executes the same agent code but returns recorded outputs instead of calling live LLMs or tools.
# Original run (live API calls)
result = runtime.run(agent_name=#a3d95f]">"researcher", input={})
# → Makes real LLM calls, searches the web, etc.
# → Saves ExecutionRecord "text-[#9ecbff]">with all inputs/outputs
# Replay (no live calls)
result = runtime.replay(run_id=#a3d95f]">"abc123def456")
# → Re-executes the agent code
# → LLM calls "text-[#9ecbff]">return recorded responses
# → Tool calls "text-[#9ecbff]">return recorded results
# → No network, no side effectsWhy It Matters
Safe Failure Reproduction
Production failure? Replay it safely.
Without replay: You must re-trigger the failure with live APIs, risking:
- Duplicate email notifications
- Duplicate refunds
- Repeated API charges
- External side effects
With replay: You step through the exact execution with zero external impact.
Regression Detection
Change your agent code? Detect regressions.
Without replay: You run tests against your new code. But are the test outcomes the same as before? Hard to know if subtle behavior changed.
With replay: Replay your production traces against new code. If inputs diverge at any step, Paprika raises ReplayMismatchError. You've caught the regression before shipping.
Behavioral Confidence
"Did I break something?"
→ Replay old traces against new code. If there are no mismatches, behavior is identical. If there are mismatches, you know exactly which step and why.
How It Works
Under the Hood
When you call runtime.replay(run_id="abc123def456"):
- Load the ExecutionRecord from disk
- Extract recorded outputs from each step
- Build stub maps:
_llm_stubs[step_index]→ cached LLM outputs_tool_stubs[step_index]→ cached tool results_llm_hashes[step_index]→ original input hashes_tool_hashes[step_index]→ original input hashes
- Re-execute the agent code:
- When
ctx.llm.call(...)is invoked, return the cached output from_llm_stubs - When
ctx.tools.call(...)is invoked, return the cached result from_tool_stubs - Recompute the input hash at each step
- Compare hashes:
- If recomputed hash matches original → ✓ step continues
- If recomputed hash differs from original → ✗
ReplayMismatchError
Input Hash Recomputation
During replay, Paprika recomputes input hashes at each step.
Original run (step 0):
# Agent code
response = ctx.llm.call(
provider=#a3d95f]">"openai",
model=#a3d95f]">"gpt-4o",
input={#a3d95f]">"messages": [{"role": "user", "content": "What is AI?"}]}
)
# → Input hash: a1b2c3d4e5f6g7h8
# → Recorded in ExecutionRecordReplay (step 0):
# Same agent code
response = ctx.llm.call(
provider=#a3d95f]">"openai",
model=#a3d95f]">"gpt-4o",
input={#a3d95f]">"messages": [{"role": "user", "content": "What is AI?"}]}
)
# → Paprika recomputes hash: a1b2c3d4e5f6g7h8
# → Matches original hash → ✓ "text-[#9ecbff]">continue
# → Return cached output (no live API call)If the agent code changed:
# Changed code
response = ctx.llm.call(
provider=#a3d95f]">"openai",
model=#a3d95f]">"gpt-4o",
input={#a3d95f]">"messages": [{"role": "user", "content": "What is Machine Learning?"}]} # ← Different!
)
# → Paprika recomputes hash: y9z0a1b2c3d4e5f6 (different!)
# → Does NOT match original: a1b2c3d4e5f6g7h8
# → Raise ReplayMismatchErrorSide-Effect Safety
Replay disables live API calls and disables real tool execution. This prevents:
- ✗ Duplicate LLM API calls (no billing)
- ✗ Duplicate emails sent by tools
- ✗ Duplicate database mutations
- ✗ Duplicate external API calls
- ✗ Any side effects
All ctx.llm.call() and ctx.tools.call() return recorded, cached results.
Mismatch Detection
If the replayed agent produces a different input hash at any step, Paprika raises ReplayMismatchError.
Error format:
#a3d95f]">"text-[#9ecbff]">from paprika "text-[#9ecbff]">import ReplayMismatchError
#a3d95f]">"text-[#9ecbff]">try:
result = runtime.replay(run_id=#a3d95f]">"abc123def456")
#a3d95f]">"text-[#9ecbff]">except ReplayMismatchError "text-[#9ecbff]">as e:
print(e.step_index) # 0 (which step)
print(e.expected) # "a1b2c3d4e5f6g7h8" (original hash)
print(e.actual) # "y9z0a1b2c3d4e5f6" (new hash)
print(str(e)) # "Replay mismatch at step 0: expected a1b2c3d4e5f6g7h8, got y9z0a1b2c3d4e5f6"Why Mismatches Happen
- Code change — you changed the LLM prompt or tool arguments
- Logic change — you changed which tools are called or in what order
- Conditional change — you changed an if/else that affects execution flow
- Anything that changes inputs — code change → input change → hash change
What Mismatches Mean
A mismatch means: the agent's behavior changed at this step.
This is exactly what you want to catch before shipping. It tells you:
- "You changed the code"
- "The change affects agent behavior"
- "Here's the exact step where it diverges"
Full Workflow Example
1. Original Run
@runtime.agent(name=#a3d95f]">"researcher")
#a3d95f]">"text-[#9ecbff]">def researcher(ctx):
response = ctx.llm.call(
provider=#a3d95f]">"openai",
model=#a3d95f]">"gpt-4o",
input={#a3d95f]">"messages": [{"role": "user", "content": "What is AI?"}]}
)
#a3d95f]">"text-[#9ecbff]">return response.get("choices", [{}])[0].get("message", {}).get("content", "")
result = runtime.run(#a3d95f]">"researcher", {})
# → Saves ExecutionRecord "text-[#9ecbff]">with record_id "abc123def456"
# → LLM returns: "AI is artificial intelligence"
# → Input hash at step 0: a1b2c3d4e5f6g7h82. Code Change
@runtime.agent(name=#a3d95f]">"researcher")
#a3d95f]">"text-[#9ecbff]">def researcher(ctx):
response = ctx.llm.call(
provider=#a3d95f]">"openai",
model=#a3d95f]">"gpt-4o",
input={#a3d95f]">"messages": [{"role": "user", "content": "What is Machine Learning?"}]} # ← CHANGED
)
#a3d95f]">"text-[#9ecbff]">return response.get("choices", [{}])[0].get("message", {}).get("content", "")3. Replay Against Old Trace
#a3d95f]">"text-[#9ecbff]">try:
result = runtime.replay(run_id=#a3d95f]">"abc123def456")
#a3d95f]">"text-[#9ecbff]">except ReplayMismatchError "text-[#9ecbff]">as e:
print(f#a3d95f]">"Mismatch at step {e.step_index}")
print(f#a3d95f]">"Expected: {e.expected}")
print(f#a3d95f]">"Actual: {e.actual}")
# Output:
# Mismatch at step 0
# Expected: a1b2c3d4e5f6g7h8
# Actual: y9z0a1b2c3d4e5f64. Interpret
Paprika caught the change. The prompt changed → the input hash changed → the behavioral change was detected.
You know before shipping that this code change affects agent behavior.
Diffing Two Runs
Compare original and replayed runs:
paprika runs diff abc123def456 xyz789abc123Output:
Step 0 (llm_call): MISMATCH
Expected hash: a1b2c3d4e5f6g7h8
Actual hash: y9z0a1b2c3d4e5f6
Provider: openai
Model: gpt-4o
Step 1 (tool_call): MATCH
Hash: i9j0k1l2m3n4o5p6
Tool: search
Step 2 (llm_call): MATCH
Hash: q7r8s9t0u1v2w3x4The diff shows which steps changed.
Connecting Runs
Replayed runs are marked with replay_of field in the ExecutionRecord:
{
"text-[#9ecbff]">"record_id": "xyz789abc123",
"text-[#9ecbff]">"replay_of": "abc123def456", // ← Original run
...
}This links the replay back to its original. Useful for tracking:
- "Which run was this a replay of?"
- "Did we find a mismatch when we replayed?"
Workflow: Regression Testing
- Run your agent in production
ExecutionRecord saved: abc123def456- Change your code (bug fix, prompt change, logic update)
- Replay all production traces
#a3d95f]">"text-[#9ecbff]">for run_summary in runtime.trace_store.list_runs(limit=100):
#a3d95f]">"text-[#9ecbff]">try:
runtime.replay(run_id=run_summary.run_id)
print(f#a3d95f]">"✓ {run_summary.run_id}: no mismatch")
#a3d95f]">"text-[#9ecbff]">except ReplayMismatchError "text-[#9ecbff]">as e:
print(f#a3d95f]">"✗ {run_summary.run_id}: mismatch at step {e.step_index}")- If there are mismatches, investigate:
- Is the behavioral change intentional?
- If not, revert the code change
- If yes, document the intentional change and move forward
This is behavioral regression testing.
Current Limitations
- No batch replay yet — replay one run at a time (manual loop for batch)
- No counterfactual patching yet — can't modify individual steps before replaying
- Replay is deterministic but not live — no A/B testing against live models
- Only deterministic replay — future: interactive debugging with stepper
Next Steps
- Set runtime policies: Policies
- Inspect ExecutionRecords: Execution Records
- Integrate with your framework: Integrations
- Use CLI to manage runs: CLI