Operations & reliability
JARAI’s pipeline is asynchronous and self-healing: transient failures retry, failing providers are routed around, and permanent failures are surfaced as alerts and parked on a dead-letter queue for an operator. This page covers the tools that keep productions flowing and tell you when something needs a human.
Prefer to read? Open the step-by-step transcript
- Dashboards → Alert operations — open alerts by type and severity.
- Settings → DLQ management — messages that failed permanently; inspect and resubmit.
- Production recovery — productions stuck mid-pipeline are detected and re-driven automatically.
- Provider health — failing providers are routed around so productions keep moving.
Retry vs dead-letter (how failures are classified)
Every pipeline step classifies its failures:
- Retriable (transient — SQL pool exhaustion, provider 5xx, lock lost, blob timeout): the message returns to the queue and the step status becomes
Retrying. No action needed. - Non-retriable (permanent — contract violation, missing required record, retries exhausted): the message dead-letters immediately and the step status becomes
Failed, carrying a structured error envelope (error class/code, function, brief/step, correlation id).
Alerts & escalation
- Alert operations dashboard (
Dashboards → Alert operations) shows open alerts grouped by type and severity, sourced from theAlertQueue. - The AlertDispatcher routes alerts to their channels; the EscalationFunction raises severity for alerts left unacknowledged past their SLA; a periodic digest summarises activity.
- Acknowledge an alert once you’ve actioned it so escalation stops.
Dead-letter queue (DLQ)
Settings → DLQ management lists permanently-failed messages. For each you can inspect the error envelope and resubmit once the root cause is fixed.
- Read the error — the envelope tells you the error class/code, function, and the production/step it belongs to.
- Fix the cause — e.g. a missing credential, a bad template/contract, or a provider misconfiguration.
- Resubmit — the message re-enters its topic and the step re-runs from a clean state (idempotency guards prevent double-processing of already-complete steps).
Stuck-production recovery
The ProductionRecoveryFunction periodically detects productions that have stalled mid-pipeline (e.g. a message lost to an infrastructure blip) and re-drives them. Combined with idempotency guards on every step, this means most stalls self-resolve without an operator touching them.
Provider health & quality
- ProviderHealthMonitor maintains the Healthy/Degraded/Failing/Suspended badge used by the model-selection chain (see AI providers & models).
- RapidQualityCheck compares each step’s quality score against the account’s quality floor and flags regressions, so quality dips surface as alerts rather than silently shipping.
Where to look when something’s wrong
| Symptom | Look here |
|---|---|
| A production is “Failed” | Production detail → the failed step’s error; resubmit via DLQ after fixing |
| Many failures from one vendor | Provider health badge; the chain should already be routing around it |
| Productions not starting | Budget/concurrency gates (account limits) and provider health |
| Quality regressions | RapidQualityCheck alerts on the Alert operations dashboard |
© 2026 JARAI STUDIO Ltd. All rights reserved.