PM Cookbook: Foundations of a Repeatable Evals Process

Flow of this cookbook:

Prompt Engineering: Iterate on prompts and model choices, compare multiple versions, and identify the best combination for your use case.
Offline Evals: Test your agent’s performance across hundreds of user queries and scenarios, and define the quality metrics that matter most, using Maxim’s built-in or your own custom evals.
Observability: Log and trace complex agentic trajectories and get real-time insight into quality, performance, and failure modes.
Online Evals: Run continuous quality checks on production logs, receive proactive alerts on performance/quality degradation, and improve agents over time.

Here, we’ll walk through an example of a healthcare scribing agent that takes a doctor–patient conversation and a patient ID as input, checks the patient’s history, and generates structured clinical documentation.

Prompt Engineering

The starting point for any AI product involves experimenting with prompts, models, and parameters to identify which combinations deliver the best responses for your use case. This is an iterative process that requires managing and comparing multiple versions. In this section, we will set up and run prompt experiments using the Maxim UI:

Define system and user messages, and select from 1,000+ models supported across providers on the platform.
Run manual iterations to refine prompts and adjust model parameters such as temperature or response format.
Attach tools to your prompt as needed. In this example, we’ll attach an API-based prompt tool that retrieves a patient’s history from the database using their patient ID and uses that for clinical note generation.
Manage multiple versions of a prompt and run side-by-side comparisons to evaluate differences in generated output.

However, manual validation is difficult to scale beyond a few test cases.

Offline Evals

To build confidence in your workflows and evaluate quality at scale, we’ll set up and run automated evaluations on different versions of the prompt and evaluate their performance across multiple queries and scenarios.

First, we’ll create a test dataset, which in this case is a collection of doctor–patient interaction transcripts and corresponding patient IDs.
- During test runs, each transcript is passed to the agent along with the fetched patient history. You can analyze the quality of the generated notes using Maxim’s built-in and custom evals.
Next, to measure key metrics of your agent’s quality, we’ll attach evaluators to our test runs. You can use Maxim’s built-in evals or create custom, domain-specific evaluators to ensure key SOPs are being followed.
- Use human evals to ask SMEs to score outputs directly in the report, or use Maxim’s human-annotator dashboard to request external SMEs to rate outputs and add comments or corrections without requiring access to the main platform.
Trigger evaluation runs to compare which agent version performs best, and make informed decisions by combining evaluation metrics with model parameters such as cost, latency, etc.

Evaluation run report provides a comprehensive view of agent performance and output quality, enabling you to make metric-driven decisions by clearly highlighting the trade-offs between different versions.

Get a high-level summary of evaluation scores and agent performance, along with a unified comparison view that shows the input parameters, tools called, generated output, and eval scores for each test case across different versions of the agent.
Filter by quality metrics and dig deeper into specific test cases to analyze the generated output and evaluator reasoning — understand why a particular score was given, identify failure modes, and use those insights to improve your agents.
Share reports with relevant stakeholders, track quality improvement trends across test runs, and deploy the better-performing version directly from the Maxim platform.

Observability

To ensure the reliability of AI workflows, it is essential to maintain complete visibility into agent interactions and proactively detect, debug, and resolve any failure modes. Using Maxim, we can set up effective observability practices, both pre- and post-release, to:

Log your agent interactions at a granular level, analyze each node — retrieval, tool calls, summarization, and more — and trace the journey that leads to the final output.
Track key performance metrics like cost, tokens, latency, user feedback etc. and monitor their trends over time through interactive charts.
Use the omnibar to search and filter through logs, navigate to failure cases, and dig deeper into these scenarios to understand what went wrong. Save these views to revisit them over time.

These logs and traces serve as a structured source of insight - enabling teams to identify recurring failure modes through error analysis, uncover common themes and agent behavior patterns, and surface edge cases. These insights help establish a continuous feedback loop between production and development.

Online Evals

As production traffic scales, manually reviewing every log, trace, or span becomes impractical. Online evaluations enable continuous, automated quality checks on agent performance — scoring granular interactions, including sessions, traces, spans, generations, retrieval steps, and tool calls, against defined metrics and surfacing those that fall below quality thresholds.

Set up online evaluations on production logs using the same type of evals applied pre-release, giving you a holistic view of agent quality across the entire development lifecycle.
Configure automated alerts to receive real-time notifications whenever performance or quality metrics degrade, or when a custom error condition is triggered, ensuring the user experience is not impacted.
Narrow down the low-scoring logs/traces and analyze them to identify root causes and recurring failure modes.
Annotate these interactions, rewrite corrected outputs, and use them to curate and refine your test datasets — creating new ground truth for future agent iterations and evaluation runs.

We hope this cookbook helps you set up a repeatable process to ensure the reliability of your agent at every stage of development, enabling you to move faster on your product roadmap while keeping user trust at the core of your AI products.

Connect with the Maxim team for hands-on support in setting up evals for your production use cases.

Product Cookbooks

Integrations

SDK

Platform Features

PM Cookbook: Foundations of a Repeatable Evals Process

Prompt Engineering

Offline Evals

Observability

Online Evals

Product Cookbooks

Integrations

SDK

Platform Features

​Prompt Engineering

​Offline Evals

​Observability

​Online Evals

Prompt Engineering

Offline Evals

Observability

Online Evals