Skip to content

Braintrust

Braintrust

Everruns integrates with Braintrust to provide LLM observability, evaluation, and trace visualization for your agentic workflows.

  • Turn Traces Grouped by Session: Keep one trace per turn while grouping the conversation by metadata.session_id
  • Token Usage Tracking: Monitor input/output tokens and prompt cache efficiency
  • Performance Metrics: Time-to-first-token, LLM call duration, tool execution times
  • Durable-ish Delivery: Buffered batch delivery with retries for rate limits, 5xx, and timeout/connect failures
  • Privacy Controls: Raw content, thinking, tool args, and tool results are independently configurable
  1. Sign up at braintrust.dev
  2. Go to SettingsAPI Keys
  3. Create a new API key

Set environment variables:

Terminal window
# Optional explicit switch
export BRAINTRUST_ENABLED=true
# Required
export BRAINTRUST_API_KEY=sk-bt-your-api-key
# Recommended: specify your project name
export BRAINTRUST_PROJECT_NAME="My Project"
# Conservative defaults
export BRAINTRUST_RECORD_CONTENT=false
export BRAINTRUST_RECORD_THINKING=none
export BRAINTRUST_TOOL_ARGS_MODE=redacted
export BRAINTRUST_TOOL_RESULTS_MODE=summary
VariableRequiredDefaultDescription
BRAINTRUST_ENABLEDNoenabled when API key is presentExplicit Braintrust on/off switch
BRAINTRUST_API_KEYYes-API key from Braintrust settings
BRAINTRUST_PROJECT_NAMENoMy ProjectProject name for organizing traces
BRAINTRUST_PROJECT_IDNo-Direct project UUID (skips name lookup)
BRAINTRUST_API_URLNohttps://api.braintrust.devAPI base URL
BRAINTRUST_QUEUE_CAPACITYNo1024Buffered event capacity before new exports are dropped
BRAINTRUST_MAX_BATCH_SIZENo50Max events per Braintrust insert call
BRAINTRUST_FLUSH_INTERVAL_MSNo500Max delay before a partial batch flushes
BRAINTRUST_REQUEST_TIMEOUT_MSNo10000Per-request timeout
BRAINTRUST_MAX_RETRIESNo3Retries for 429, 5xx, and timeout/connect failures
BRAINTRUST_RETRY_BASE_DELAY_MSNo250Initial retry backoff
BRAINTRUST_RETRY_MAX_DELAY_MSNo5000Retry backoff cap
BRAINTRUST_RECORD_CONTENTNofalseExport raw turn and LLM text content
BRAINTRUST_RECORD_THINKINGNononeExport thinking as none, summary, or full
BRAINTRUST_TOOL_ARGS_MODENoredactedExport tool args as full, redacted, or none
BRAINTRUST_TOOL_RESULTS_MODENosummaryExport tool results as full, summary, redacted, or none
BRAINTRUST_DEBUG_PAYLOADSNofalsePrint full outbound Braintrust payload JSON to local debug logs
  1. Open the Braintrust dashboard
  2. Navigate to your project
  3. Go to Logs
  4. Group or filter by metadata.session_id to reconstruct the full session timeline across turn traces

Each Everruns turn creates its own trace with the following structure:

agent turn (root span)
├── reason (iteration 1)
│ └── llm.generation (gpt-4o)
├── act (iteration 1)
│ ├── tool.call (search)
│ └── tool.call (fetch)
├── reason (iteration 2)
│ └── llm.generation (gpt-4o)
└── (no more tool calls - turn complete)
SpanTypeDescription
Agent TurntaskRoot span for the entire user request
ReasontaskLLM reasoning phase (may iterate)
ActtaskTool execution phase
LLM GenerationllmIndividual LLM API call
Tool CalltoolIndividual tool execution

Everruns does not export one giant trace for the whole conversation.

  • Each turn remains its own Braintrust trace.
  • Every root turn span carries metadata.session_id.
  • Session lifecycle events (session.started, session.activated, session.idled) are exported as lightweight logs with the same session_id.
  • Root turn metadata also carries stable filtering fields when available, such as input_message_id, monotonic event ordering, deployment grade, session status, model/provider summary, retry info, and compaction info.

Use Braintrust grouping, timeline, or thread views on metadata.session_id to analyze the session as a whole while keeping per-turn debugging sharp.

  • prompt_tokens - Input token count
  • completion_tokens - Output token count
  • cache_read_tokens - Tokens read from prompt cache (Claude)
  • cache_creation_tokens - Tokens written to prompt cache (Claude)
  • time_to_first_token - Time until first token received
  • duration_ms - Total LLM call duration
  • status - Success/failure
  • duration_ms - Execution time
  • error - Error message (on failure)
  • Exports enqueue into a bounded in-memory buffer.
  • The exporter flushes batches to POST /v1/project_logs/{project_id}/insert.
  • 429, 5xx, timeout, and connect failures are retried with jittered backoff.
  • If the queue fills, new events are dropped and the exporter logs the drop counter.

This is best-effort durability, not a disk-backed queue.

The Braintrust exporter defaults to conservative content handling:

  • raw turn and LLM text are off unless BRAINTRUST_RECORD_CONTENT=true
  • when raw content is off, the exporter emits structural metadata only; it does not emit truncated prompt/completion previews
  • extended thinking is off unless BRAINTRUST_RECORD_THINKING says otherwise
  • tool arguments default to redacted
  • tool results default to summary
  • tool arg/result modes still apply inside recorded LLM input/output payloads
  • full outbound payload logging is off unless BRAINTRUST_DEBUG_PAYLOADS=true
  1. Check API key: Verify BRAINTRUST_API_KEY is set correctly
  2. Check project resolution: If BRAINTRUST_PROJECT_NAME does not match an existing project, startup logs will show a project resolution failure
  3. Check exporter logs: Look for rate-limit retries, timeout retries, queue drops, or permanent insert failures
  1. Confirm root turn spans include metadata.session_id
  2. Group Braintrust logs by metadata.session_id
  3. Check whether privacy controls removed content you expected; the default is conservative