Resilience

Resilience

Resilience

Reeboot is designed to recover gracefully from crashes, provider outages, and unexpected restarts. All resilience behaviour is configurable.


Crash Recovery

Every agent turn is recorded in the turn_journal table before it starts. If reeboot crashes mid-turn, the open journal entry survives. On restart, reeboot detects these open entries and can automatically replay the affected turns.

Recovery Modes

Configure how crashed turns are handled:

{
  "resilience": {
    "recovery": {
      "mode": "safe_only",
      "side_effect_tools": ["send_email", "send_sms"]
    }
  }
}
ModeBehaviour
"safe_only"Automatically replay turns that used no side-effect tools. Notify the user for turns that did use them.
"always"Replay all crashed turns automatically, regardless of what tools were called.
"never"Never replay crashed turns. Notify the user and let them decide.

The side_effect_tools list identifies tools that should not be replayed automatically (e.g. tools that send messages, charge payments, or modify external state). Add any MCP tool names that have external side effects.


Outage Detection

When reeboot receives consecutive provider errors (rate limits, API failures, connectivity issues), it declares a provider outage after outage_threshold failures.

During an outage:

  • New turns are queued rather than executed
  • Reeboot probes the provider at probe_interval to detect recovery
  • When the provider responds, reeboot exits outage mode and processes the queue
{
  "resilience": {
    "outage_threshold": 3,
    "probe_interval": "1h"
  }
}

Scheduler Catchup

When reeboot restarts, it checks for scheduled tasks that were missed during downtime. Tasks within the catchup_window are replayed immediately. Tasks older than the window are skipped (their next natural run is scheduled instead).

{
  "resilience": {
    "scheduler": {
      "catchup_window": "1h"
    }
  }
}

Individual tasks can override the global catchup behaviour by setting a catchup field when created:

  • "always" — always fire regardless of age
  • "never" — never catch up; advance to next natural run

Restart Notification

When reeboot starts and detects a previous run (via a reeboot_state marker in the database), it sends a notification to the default channel announcing the restart. This lets you know if the agent went down unexpectedly.

On first startup, no notification is sent (no previous run marker exists).


Configuration Reference

FieldTypeDefaultDescription
resilience.recovery.modestring"safe_only"Crash recovery mode: "safe_only", "always", or "never".
resilience.recovery.side_effect_toolsstring[][]Tool names considered unsafe to replay automatically.
resilience.outage_thresholdnumber3Consecutive failures before an outage is declared.
resilience.probe_intervalstring"1h"How often to probe the provider during an active outage.
resilience.scheduler.catchup_windowstring"1h"How far back to look for missed scheduled tasks on restart.