Skip to main content

Error handling in Bosun tasks

Bosun's runtime surfaces step failures immediately, but you control whether those failures should stop the task. Every step honours the continue_on_error flag so you can keep the workflow moving while still recording detailed diagnostics.

tip

When using tasks as graphs, you can also configure error handling at the edge level. This enables branching based on success or failure states. See Task Graphs and Branching for details.

Default behaviour

If a step errors and continue_on_error is omitted (or set to false), execution stops at that point. Typical failure modes include:

  • agent: the agent calls the fail tool, instruction rendering fails, or the model session errors.
  • run: the shell command exits with a non-zero status or the executor encounters an error.
  • prompt / structured_prompt: templating fails, the model call errors, or (for structured prompts) the response cannot be validated against the schema.
  • for_each: any iteration fails while continue_on_error is disabled; in-flight work is cancelled and the error bubbles out.

Continuing after failures

Set continue_on_error: true on any step to log the failure and continue with the rest of the task.

steps:
- id: format_sources
run: npm run fmt
continue_on_error: true

- id: notify
prompt: |
Formatting left {{ errors.format_sources | length }} issues.
Details:\n{{ errors.format_sources | json_encode(pretty=true) }}

What gets recorded

When a step continues after an error:

  • The step's output will be the output of what caused the failure. If an agent fails, the payload from the task_failed tool (when present) becomes the step output. If a run step fails, the captured stdout and stderr fields are preserved.
  • The same object is appended to the errors collection in the template context. Access it either by step index (errors.0) or by id (errors.format_sources). Each entry is an array because a for_each step can emit multiple failures.
  • for_each adds extra metadata for each failing iteration, including the index and the rendered input.

You can use this structured data to branch on specific failure types, produce concise summaries, or feed the details into a follow-up agent.

Designing resilient workflows

  • Enable continue_on_error on steps where a failure should not block the workflow, then add follow-up steps that inspect errors.<step> to decide on remediation.
  • Combine templating helpers like json_encode, length, or first to present concise summaries to humans or agents.
  • For for_each, collect the failures and spin up a targeted task (or rerun the loop) with the inputs that still need work.
  • For convenience, outputs from failed steps are still available under outputs.<step>, so you can mix-and-match success and failure data as needed.

Structured stop/fail payloads

Agents expose two schema-aware tools:

  • stop returns whatever schema you attach with stop_schema. Without overrides the tool expects a single output string, but you can request richer data (arrays, booleans, nested objects) when needed.
  • task_failed follows the schema in fail_schema. The runtime records that JSON in the step output and in errors.<step>[i].reason, so you can branch on specific keys without parsing free-form prose.

Example: capture retry signals whenever the agent cannot complete its work.

steps:
- id: stabilize_tests
agent:
extends: Coding
instructions: "Fix the flaky tests listed in the issue."
fail_schema:
type: object
required: [summary, retryable]
properties:
summary:
type: string
retryable:
type: boolean

If the agent emits:

{
"summary": "CI cannot reach the staging database",
"retryable": false
}

then outputs.stabilize_tests.summary mirrors the JSON, and errors.stabilize_tests[0].reason.retryable is false. A follow-up prompt step can check that field and notify an operator only when manual help is required.

Bosun also preserves any successful output it had already stored before the failure, so enabling continue_on_error no longer deletes useful data.