Skip to content

Get system health

GET
/v1/durable/health
curl --request GET \
--url https://app.everruns.com/api/v1/durable/health

System health

Media type application/json

System health response

object
active_workers
required

Number of workers in the running state, ready to claim tasks.

integer
claimed_tasks
required

Tasks currently claimed by a worker (gauge).

integer
completed_tasks
required

Cumulative count of tasks that completed successfully (monotonic counter).

integer
completed_workflows
required

Cumulative count of workflows that completed successfully (monotonic counter).

integer
current_load
required

Total tasks currently in flight across all workers.

integer
dlq_size
required

Size of the dead-letter queue (gauge). High values indicate stuck activities.

integer
event_delivery

Event-delivery backend in use: nats for distributed deployments, in_memory for single-instance. None if the field was omitted by an older server.

string | null
failed_tasks
required

Cumulative count of tasks that failed terminally or were sent to the DLQ (monotonic counter).

integer
failed_workflows
required

Cumulative count of workflows that ended in failure (monotonic counter).

integer
load_percentage
required

current_load / total_capacity * 100. 0.0 when no workers are registered.

number format: double
pending_tasks
required

Tasks waiting to be claimed (gauge).

integer
pending_workflows
required

Workflows waiting to be claimed (gauge).

integer
running_workflows
required

Workflows currently executing (gauge).

integer
started_tasks
required

Cumulative count of tasks claimed at least once (monotonic counter).

integer
started_workflows
required

Cumulative count of workflows that started (monotonic counter).

integer
status
required

Aggregate system status: healthy, degraded, or unhealthy. Derived from worker availability, load, and queue depths.

string
total_capacity
required

Sum of max_concurrency across all workers (the upper bound on concurrent task execution).

integer
total_workers
required

Total number of workers registered (heartbeating in the last window).

integer
workers_accepting
required

Number of workers currently accepting new task assignments (subset of active_workers; drains/backpressure excluded).

integer
Example
{
"active_workers": 4,
"claimed_tasks": 7,
"completed_tasks": 12041,
"completed_workflows": 4128,
"current_load": 7,
"dlq_size": 0,
"event_delivery": "nats",
"failed_tasks": 34,
"failed_workflows": 12,
"load_percentage": 21.875,
"pending_tasks": 2,
"pending_workflows": 1,
"running_workflows": 3,
"started_tasks": 12082,
"started_workflows": 4144,
"status": "healthy",
"total_capacity": 32,
"total_workers": 4,
"workers_accepting": 4
}

Internal server error

Media type application/json

Standard error response.

Wire shape is RFC 9457 Problem Details: every error response includes title and status, and may include detail, code, allowed_actions, retry_after_seconds, instance, and type. The content type is rewritten to application/problem+json by [problem_json_content_type].

object
allowed_actions

Recovery actions the caller can take next.

Array<object>

Agent-actionable link describing a follow-up the caller can take. Used in two contexts:

  • Error recoveryErrorResponse.allowed_actions carries rels like retry, retry-later, unarchive, get-existing so the agent knows the right next call after a 4xx/429.
  • Entity hypermediaWithUrls<T>.allowed_actions carries state-aware rels like cancel, events, self, update on the entity itself so the agent can follow links instead of reconstructing routes from prose.

The shape is intentionally identical across both contexts; the closed rel vocabulary documented in specs/api-conventions.md distinguishes them.

object
hint

Short, agent-readable hint (e.g. “Shorten ‘name’ to <= 200 chars.”, “Cancel the active turn for this session.”).

string | null
href

Absolute (preferred) or relative URL the caller may invoke directly. Always present on entity hypermedia actions (WithUrls<T>.allowed_actions); optional on error-recovery actions (ErrorResponse.allowed_actions) where the matching operation_id is enough and the URI is implicit from the failed call.

string | null
method

HTTP method to use against href. Required for entity hypermedia actions; usually omitted on error-recovery actions where the same operation is retried with its original method.

string | null
operation_id

OpenAPI operationId the caller should invoke. Lets an MCP client resolve the call without parsing href.

string | null
rel
required

Link relation describing the action. Closed vocabulary documented in specs/api-conventions.md — examples: self, cancel, pause, resume, events, retry, retry-later, unarchive, get-existing, delete, update.

string
schema_ref

OpenAPI $ref to the request-body schema, when the action takes one (e.g. #/components/schemas/UpdateSessionRequest). Lets a tool-calling agent fetch the input shape without scanning the whole spec.

string | null
code

Stable, machine-readable error code (snake_case).

string | null
detail

Human-readable explanation specific to this occurrence.

string | null
instance

Request URI for this occurrence.

string | null
retry_after_seconds

Seconds the caller should wait before retrying (429 / transient 503).

integer | null format: int32
status
required

HTTP status code; mirrors the response status line.

integer format: int32
title
required

Short, human-readable summary of the problem (e.g. “Not Found”).

string
type

RFC 9457 problem type URI. Optional; identifies the problem class.

string | null
Example
{
"allowed_actions": [
{
"method": "POST"
}
],
"code": "session_not_found",
"detail": "Session session_01933b5a000070008000000000000001 not found in org org_01933b5a000070008000000000000001.",
"instance": "/v1/sessions/session_01933b5a000070008000000000000001",
"retry_after_seconds": 30,
"status": 404,
"title": "Session not found",
"type": "https://docs.everruns.com/errors/session_not_found"
}