Replay Engine

Deterministic replay is Paprika's core differentiator. Re-execute any previous run using recorded outputs instead of making live API calls. No external side effects. If behavior diverges, Paprika raises ReplayMismatchError with exact step and hash diff.

What Replay Does

Given an ExecutionRecord, replay re-executes the same agent code but returns recorded outputs instead of calling live LLMs or tools.

python

# Original run (live API calls)
result = runtime.run(agent_name=#a3d95f]">"researcher", input={})
# → Makes real LLM calls, searches the web, etc.
# → Saves ExecutionRecord "text-[#9ecbff]">with all inputs/outputs

# Replay (no live calls)
result = runtime.replay(run_id=#a3d95f]">"abc123def456")
# → Re-executes the agent code
# → LLM calls "text-[#9ecbff]">return recorded responses
# → Tool calls "text-[#9ecbff]">return recorded results
# → No network, no side effects

Why It Matters

Safe Failure Reproduction

Production failure? Replay it safely.

Without replay: You must re-trigger the failure with live APIs, risking:

Duplicate email notifications
Duplicate refunds
Repeated API charges
External side effects

With replay: You step through the exact execution with zero external impact.

Regression Detection

Change your agent code? Detect regressions.

Without replay: You run tests against your new code. But are the test outcomes the same as before? Hard to know if subtle behavior changed.

With replay: Replay your production traces against new code. If inputs diverge at any step, Paprika raises ReplayMismatchError. You've caught the regression before shipping.

Behavioral Confidence

"Did I break something?"

→ Replay old traces against new code. If there are no mismatches, behavior is identical. If there are mismatches, you know exactly which step and why.

How It Works

Under the Hood

When you call runtime.replay(run_id="abc123def456"):

Load the ExecutionRecord from disk
Extract recorded outputs from each step
Build stub maps:

_llm_stubs[step_index] → cached LLM outputs
_tool_stubs[step_index] → cached tool results
_llm_hashes[step_index] → original input hashes
_tool_hashes[step_index] → original input hashes

Re-execute the agent code:

When ctx.llm.call(...) is invoked, return the cached output from _llm_stubs
When ctx.tools.call(...) is invoked, return the cached result from _tool_stubs
Recompute the input hash at each step

Compare hashes:

If recomputed hash matches original → ✓ step continues
If recomputed hash differs from original → ✗ ReplayMismatchError

Input Hash Recomputation

During replay, Paprika recomputes input hashes at each step.

Original run (step 0):

python

# Agent code
response = ctx.llm.call(
    provider=#a3d95f]">"openai",
    model=#a3d95f]">"gpt-4o",
    input={#a3d95f]">"messages": [{"role": "user", "content": "What is AI?"}]}
)
# → Input hash: a1b2c3d4e5f6g7h8
# → Recorded in ExecutionRecord

Replay (step 0):

python

# Same agent code
response = ctx.llm.call(
    provider=#a3d95f]">"openai",
    model=#a3d95f]">"gpt-4o",
    input={#a3d95f]">"messages": [{"role": "user", "content": "What is AI?"}]}
)
# → Paprika recomputes hash: a1b2c3d4e5f6g7h8
# → Matches original hash → ✓ "text-[#9ecbff]">continue
# → Return cached output (no live API call)

If the agent code changed:

python

# Changed code
response = ctx.llm.call(
    provider=#a3d95f]">"openai",
    model=#a3d95f]">"gpt-4o",
    input={#a3d95f]">"messages": [{"role": "user", "content": "What is Machine Learning?"}]}  # ← Different!
)
# → Paprika recomputes hash: y9z0a1b2c3d4e5f6  (different!)
# → Does NOT match original: a1b2c3d4e5f6g7h8
# → Raise ReplayMismatchError

Side-Effect Safety

Replay disables live API calls and disables real tool execution. This prevents:

✗ Duplicate LLM API calls (no billing)
✗ Duplicate emails sent by tools
✗ Duplicate database mutations
✗ Duplicate external API calls
✗ Any side effects

All ctx.llm.call() and ctx.tools.call() return recorded, cached results.

Mismatch Detection

If the replayed agent produces a different input hash at any step, Paprika raises ReplayMismatchError.

Error format:

python

#a3d95f]">"text-[#9ecbff]">from paprika "text-[#9ecbff]">import ReplayMismatchError

#a3d95f]">"text-[#9ecbff]">try:
    result = runtime.replay(run_id=#a3d95f]">"abc123def456")
#a3d95f]">"text-[#9ecbff]">except ReplayMismatchError "text-[#9ecbff]">as e:
    print(e.step_index)      # 0 (which step)
    print(e.expected)        # "a1b2c3d4e5f6g7h8" (original hash)
    print(e.actual)          # "y9z0a1b2c3d4e5f6" (new hash)
    print(str(e))            # "Replay mismatch at step 0: expected a1b2c3d4e5f6g7h8, got y9z0a1b2c3d4e5f6"

Why Mismatches Happen

Code change — you changed the LLM prompt or tool arguments
Logic change — you changed which tools are called or in what order
Conditional change — you changed an if/else that affects execution flow
Anything that changes inputs — code change → input change → hash change

What Mismatches Mean

A mismatch means: the agent's behavior changed at this step.

This is exactly what you want to catch before shipping. It tells you:

"You changed the code"
"The change affects agent behavior"
"Here's the exact step where it diverges"

Full Workflow Example

1. Original Run

python

@runtime.agent(name=#a3d95f]">"researcher")
#a3d95f]">"text-[#9ecbff]">def researcher(ctx):
    response = ctx.llm.call(
        provider=#a3d95f]">"openai",
        model=#a3d95f]">"gpt-4o",
        input={#a3d95f]">"messages": [{"role": "user", "content": "What is AI?"}]}
    )
    #a3d95f]">"text-[#9ecbff]">return response.get("choices", [{}])[0].get("message", {}).get("content", "")

result = runtime.run(#a3d95f]">"researcher", {})
# → Saves ExecutionRecord "text-[#9ecbff]">with record_id "abc123def456"
# → LLM returns: "AI is artificial intelligence"
# → Input hash at step 0: a1b2c3d4e5f6g7h8

2. Code Change

python

@runtime.agent(name=#a3d95f]">"researcher")
#a3d95f]">"text-[#9ecbff]">def researcher(ctx):
    response = ctx.llm.call(
        provider=#a3d95f]">"openai",
        model=#a3d95f]">"gpt-4o",
        input={#a3d95f]">"messages": [{"role": "user", "content": "What is Machine Learning?"}]}  # ← CHANGED
    )
    #a3d95f]">"text-[#9ecbff]">return response.get("choices", [{}])[0].get("message", {}).get("content", "")

3. Replay Against Old Trace

python

#a3d95f]">"text-[#9ecbff]">try:
    result = runtime.replay(run_id=#a3d95f]">"abc123def456")
#a3d95f]">"text-[#9ecbff]">except ReplayMismatchError "text-[#9ecbff]">as e:
    print(f#a3d95f]">"Mismatch at step {e.step_index}")
    print(f#a3d95f]">"Expected: {e.expected}")
    print(f#a3d95f]">"Actual: {e.actual}")
    # Output:
    # Mismatch at step 0
    # Expected: a1b2c3d4e5f6g7h8
    # Actual: y9z0a1b2c3d4e5f6

4. Interpret

Paprika caught the change. The prompt changed → the input hash changed → the behavioral change was detected.

You know before shipping that this code change affects agent behavior.

Diffing Two Runs

Compare original and replayed runs:

bash

paprika runs diff abc123def456 xyz789abc123

Output:

plaintext

Step 0 (llm_call): MISMATCH
  Expected hash: a1b2c3d4e5f6g7h8
  Actual hash: y9z0a1b2c3d4e5f6
  Provider: openai
  Model: gpt-4o

Step 1 (tool_call): MATCH
  Hash: i9j0k1l2m3n4o5p6
  Tool: search

Step 2 (llm_call): MATCH
  Hash: q7r8s9t0u1v2w3x4

The diff shows which steps changed.

Connecting Runs

Replayed runs are marked with replay_of field in the ExecutionRecord:

json

{
  "text-[#9ecbff]">"record_id": "xyz789abc123",
  "text-[#9ecbff]">"replay_of": "abc123def456",  // ← Original run
  ...
}

This links the replay back to its original. Useful for tracking:

"Which run was this a replay of?"
"Did we find a mismatch when we replayed?"

Workflow: Regression Testing

Run your agent in production

plaintext

   ExecutionRecord saved: abc123def456

Change your code (bug fix, prompt change, logic update)

Replay all production traces

python

   #a3d95f]">"text-[#9ecbff]">for run_summary in runtime.trace_store.list_runs(limit=100):
       #a3d95f]">"text-[#9ecbff]">try:
           runtime.replay(run_id=run_summary.run_id)
           print(f#a3d95f]">"✓ {run_summary.run_id}: no mismatch")
       #a3d95f]">"text-[#9ecbff]">except ReplayMismatchError "text-[#9ecbff]">as e:
           print(f#a3d95f]">"✗ {run_summary.run_id}: mismatch at step {e.step_index}")

If there are mismatches, investigate:

Is the behavioral change intentional?
If not, revert the code change
If yes, document the intentional change and move forward

This is behavioral regression testing.

Current Limitations

No batch replay yet — replay one run at a time (manual loop for batch)
No counterfactual patching yet — can't modify individual steps before replaying
Replay is deterministic but not live — no A/B testing against live models
Only deterministic replay — future: interactive debugging with stepper

Next Steps

Set runtime policies: Policies
Inspect ExecutionRecords: Execution Records
Integrate with your framework: Integrations
Use CLI to manage runs: CLI