Failure Modes
Catalog of architecture-level failure modes across DALP components, documenting degradation behavior, detection mechanisms, and recovery strategies for each failure scenario.
Purpose
Catalogs architecture-level failure modes, their impact, and how DALP components degrade and recover.
- Doc type: Reference
- What you'll find here:
- Failure modes per component layer
- Degradation behavior (fail-open vs fail-closed)
- Detection and recovery mechanisms
- Impact on user-facing operations
- Related:
- Observability — monitoring and alerting
- Database — data layer resilience
- Signing Flow — transaction durability
Failure mode catalog
Blockchain layer
| Failure | Impact | Behavior | Recovery |
|---|---|---|---|
| RPC node unreachable | No blockchain reads or writes | Chain Gateway fails over to next node in pool | Automatic (health check + failover) |
| All RPC nodes down | Complete blockchain outage | Transactions queue in Restate; reads return stale data | Manual: restore node connectivity |
| Block reorg | Indexed data may reflect reverted transactions | Reorg detection via block hash comparison (infrastructure in place, not yet active) | Future: automatic rollback and reprocess |
| Gas price spike | Transaction submission may fail or be slow | Automatic gas estimation; transactions retry with updated gas | Automatic retry via nonce manager |
| Nonce conflict | Transaction rejected by network | Nonce manager queues and reorders; Restate retries | Automatic |
Execution engine layer
| Failure | Impact | Behavior | Recovery |
|---|---|---|---|
| Restate server crash | In-flight workflows pause | Journaled steps preserved; automatic resume on restart | Automatic (Restate journal replay) |
| Workflow step failure | Single step in multi-step workflow fails | Restate retries with configurable backoff | Automatic retry; manual intervention if retries exhausted |
| Database connection lost | Cannot checkpoint or read state | Restate retries database operations | Automatic retry; manual if persistent |
Indexer layer
| Failure | Impact | Behavior | Recovery |
|---|---|---|---|
| Indexer crash during block processing | Gap in indexed data | Resume from last checkpoint; idempotent event processing | Automatic (checkpoint-based) |
| Event handler failure | Single event type not processed | processedEvents table prevents duplicate processing on retry | Automatic retry |
| RPC rate limit exceeded | Indexer sync slows | Batch size and concurrency respect configured limits | Automatic backoff |
API layer
| Failure | Impact | Behavior | Recovery |
|---|---|---|---|
| API server crash | Requests fail | Load balancer routes to healthy instances | Automatic (horizontal scaling) |
| Authentication service down | No new sessions | Existing sessions continue (cached); new logins fail | Restart auth service |
| Database unreachable | API returns errors | Fail-closed: operations that require data return 503 | Restore database connection |
Custody layer
| Failure | Impact | Behavior | Recovery |
|---|---|---|---|
| Custody provider unreachable | Cannot sign transactions | Transactions queue in Restate; retry on availability | Automatic retry |
| Policy engine blocks transaction | Transaction held for approval | DALP surfaces pending approval in operator interface | Manual approval or policy adjustment |
| MPC signing timeout | Transaction delayed | Restate retries signing request | Automatic retry |
Degradation philosophy
DALP follows fail-closed for security-sensitive operations:
- Compliance checks that cannot complete → transfer blocked (not allowed by default)
- Authentication failures → access denied
- Signing failures → transaction queued, not skipped
Read-only operations degrade gracefully:
- Stale indexed data is served with freshness indicators
- Cached API responses continue serving during database outages
- Dashboard shows last-known state with staleness warnings
See also
- Observability for detection and alerting
- Signing Flow for transaction durability guarantees
- Chain Gateway for blockchain failover
Database
PostgreSQL serves as the authoritative store for application data, providing ACID guarantees, mature replication capabilities, and enterprise-proven reliability for mission-critical digital asset operations.
Overview
Complete guide to deploying the Digital Asset Lifecycle Platform in your own Kubernetes or OpenShift infrastructure. Covers prerequisites, installation process, and high availability configurations for enterprise deployments.