Quickstart
Get from zero to "I understand replay mismatch detection" in 15 minutes.
1. Install
pip install paprikaRequires Python 3.11+.
2. Create a Multi-Step Agent
This agent researches a topic by making an LLM call and a tool call.
Create agent.py:
#a3d95f]">"text-[#9ecbff]">from paprika "text-[#9ecbff]">import PaprikaRuntime, PolicyConfig, PolicyViolationError, ReplayMismatchError
# Create runtime "text-[#9ecbff]">with policies
runtime = PaprikaRuntime(
policy=PolicyConfig(
max_steps=10,
max_tokens=10000,
max_repeat_hashes=3
)
)
# Register a tool
#a3d95f]">"text-[#9ecbff]">def search(query: str) -> str:
#a3d95f]">"""Mock search tool—returns a canned result."""
#a3d95f]">"text-[#9ecbff]">return f"Search results ">for '{query}': AI is advancing rapidly."
runtime.register_tool(#a3d95f]">"search", search)
# Define an agent
@runtime.agent(name=#a3d95f]">"researcher")
#a3d95f]">"text-[#9ecbff]">def researcher(ctx):
#a3d95f]">"""Research agent: asks LLM a question, then searches ">for context."""
# Step 1: Ask LLM "text-[#9ecbff]">for a research question
llm_response = ctx.llm.call(
provider=#a3d95f]">"mock",
model=#a3d95f]">"gpt-4o",
input={
#a3d95f]">"messages": [
{
#a3d95f]">"role": "user",
#a3d95f]">"content": "Generate a research question about AI trends"
}
]
}
)
question = llm_response.get(#a3d95f]">"choices", [{}])[0].get("message", {}).get("content", "What is AI?")
print(f#a3d95f]">"Generated question: {question}")
# Step 2: Use a tool to search
search_result = ctx.tools.call(
name=#a3d95f]">"search",
args={#a3d95f]">"query": question}
)
print(f#a3d95f]">"Search result: {search_result}")
# Step 3: Summarize "text-[#9ecbff]">with LLM
summary = ctx.llm.call(
provider=#a3d95f]">"mock",
model=#a3d95f]">"gpt-4o",
input={
#a3d95f]">"messages": [
{
#a3d95f]">"role": "user",
#a3d95f]">"content": f"Summarize: {search_result}"
}
]
}
)
summary_text = summary.get(#a3d95f]">"choices", [{}])[0].get("message", {}).get("content", "No summary")
#a3d95f]">"text-[#9ecbff]">return {
#a3d95f]">"question": question,
#a3d95f]">"search_result": search_result,
#a3d95f]">"summary": summary_text
}
# Mock provider setup ("text-[#9ecbff]">for deterministic examples without API keys)
runtime.trace_store.base_dir.mkdir(parents=#a3d95f]">"text-[#9ecbff]">True, exist_ok="text-[#9ecbff]">True)
#a3d95f]">"text-[#9ecbff]">if __name__ == "__main__":
# Run the agent
#a3d95f]">"text-[#9ecbff]">try:
result = runtime.run(
agent_name=#a3d95f]">"researcher",
input={}
)
print(#a3d95f]">"\n✓ Agent run completed successfully")
print(f#a3d95f]">"Result: {result}")
#a3d95f]">"text-[#9ecbff]">except PolicyViolationError "text-[#9ecbff]">as e:
print(f#a3d95f]">"\n✗ Policy violation: {e.policy_name}")
print(f#a3d95f]">" {e.details}")
#a3d95f]">"text-[#9ecbff]">except Exception "text-[#9ecbff]">as e:
print(f#a3d95f]">"\n✗ Error: {e}")For this quickstart, we're using provider="mock" so the agent returns canned responses and doesn't need live API keys. The example is fully deterministic.
Run it:
python agent.pyOutput:
Generated question: What is the latest trend in AI research?
Search result: Search results for 'What is the latest trend in AI research?': AI is advancing rapidly.
✓ Agent run completed successfully
Result: {'question': 'What is the latest trend in AI research?', 'search_result': "Search results for 'What is the latest trend in AI research?': AI is advancing rapidly.", 'summary': 'No summary'}3. Inspect the Trace
List recent runs:
paprika runs listOutput:
Run ID Agent Started Status Steps
───────────────────────────────────────────────────────────────────────────────────────
abc123def456 researcher 2024-01-15 14:32:10 success 3Inspect the full trace:
paprika runs inspect abc123def456Output (condensed):
Record ID: abc123def456
Agent: researcher
Started: 2024-01-15 14:32:10 UTC
Ended: 2024-01-15 14:32:10 UTC
Duration: 125ms
Status: success
Total tokens: 0
Steps: 3
Step 0: llm_call (gpt-4o)
Provider: mock
Model: gpt-4o
Input hash: a1b2c3d4e5f6g7h8
Tokens: 0
Duration: 10ms
Step 1: tool_call (search)
Tool: search
Input hash: i9j0k1l2m3n4o5p6
Duration: 5ms
Step 2: llm_call (gpt-4o)
Provider: mock
Model: gpt-4o
Input hash: q7r8s9t0u1v2w3x4
Tokens: 0
Duration: 10msThe trace includes every step, input hash, duration, and tokens consumed.
4. Replay the Run
Replay uses recorded outputs. No live APIs are called.
In Python:
#a3d95f]">"text-[#9ecbff]">from paprika "text-[#9ecbff]">import PaprikaRuntime
runtime = PaprikaRuntime()
# Define the same agent (unchanged code)
@runtime.agent(name=#a3d95f]">"researcher")
#a3d95f]">"text-[#9ecbff]">def researcher(ctx):
# ... same code "text-[#9ecbff]">as before ...
#a3d95f]">"text-[#9ecbff]">pass
runtime.register_tool(#a3d95f]">"search", search)
# Replay the original run
original_run_id = #a3d95f]">"abc123def456"
result = runtime.replay(run_id=original_run_id)
print(f#a3d95f]">"Replayed result: {result}")The ctx.llm.call() and ctx.tools.call() return cached outputs from the original run. No network calls are made. No side effects occur.
5. See a Mismatch
Now change the agent code slightly:
@runtime.agent(name=#a3d95f]">"researcher")
#a3d95f]">"text-[#9ecbff]">def researcher(ctx):
#a3d95f]">"""CHANGED: different search query"""
llm_response = ctx.llm.call(
provider=#a3d95f]">"mock",
model=#a3d95f]">"gpt-4o",
input={
#a3d95f]">"messages": [
{
#a3d95f]">"role": "user",
#a3d95f]">"content": "Generate a research question about LLMs" # ← CHANGED
}
]
}
)
# ... rest unchanged ...The first LLM call now has a different input. When you replay against the old trace:
#a3d95f]">"text-[#9ecbff]">try:
result = runtime.replay(run_id=#a3d95f]">"abc123def456")
#a3d95f]">"text-[#9ecbff]">except ReplayMismatchError "text-[#9ecbff]">as e:
print(f#a3d95f]">"Mismatch at step {e.step_index}")
print(f#a3d95f]">"Expected hash: {e.expected}")
print(f#a3d95f]">"Actual hash: {e.actual}")Output:
Mismatch at step 0
Expected hash: a1b2c3d4e5f6g7h8
Actual hash: y9z0a1b2c3d4e5f6This is the core differentiator: Paprika detects behavioral changes. You changed the prompt → the input hash changed → Paprika caught it.
This is how you validate that code changes don't break agent behavior, before shipping.
6. Diff Two Runs
You have:
- Original run:
abc123def456(with original prompt) - Replayed run:
xyz789abc123(with changed code)
Compare them:
paprika runs diff abc123def456 xyz789abc123Output:
Step 0 (llm_call): MISMATCH
Expected hash: a1b2c3d4e5f6g7h8
Actual hash: y9z0a1b2c3d4e5f6
Step 1 (tool_call): MATCH
Hash: i9j0k1l2m3n4o5p6
Step 2 (llm_call): MATCH
Hash: q7r8s9t0u1v2w3x4The diff shows exactly where the two runs diverged.
7. Next Steps
You now understand:
- ✓ How to write and run an agent with Paprika
- ✓ How to inspect a trace
- ✓ How to replay safely
- ✓ How mismatch detection catches behavior changes
Next topics:
- Set runtime policies to prevent runaway execution: Policies
- Understand execution records in detail: Execution Records
- Master replay and mismatch detection: Replay Engine
- Integrate with your framework: Integrations
- Use the CLI and browser UI: CLI, UI