# Failure Modes

Source: https://docs.settlemint.com/docs/architecture/operability/failure-modes
Architecture-level failure-mode reference for DALP deployments, covering
how platform components degrade, what operators can detect, and which
recovery path applies when dependencies are unavailable.


DALP failure handling separates three questions you need to answer during an incident: what is affected, whether the platform can continue safely, and who must restore the dependency. Security-sensitive work fails closed when required checks or signatures cannot complete. Durable workflows, idempotent processing, and RPC failover limit the blast radius where the platform has enough state to retry safely.

Use this page as an architecture reference, not as an incident runbook. The catalog helps you connect alerts, logs, workflow state, and high availability plans to the affected component.

Related pages:

* [Observability](/docs/architecture/operability/observability) for metrics, logs, traces, dashboards, and alerts.
* [Database](/docs/architecture/operability/database) for PostgreSQL persistence, backups, and restore planning.
* [High availability](/docs/architecture/self-hosting/high-availability) for deployment patterns, RTO, RPO, and recovery drills.
* [Signing flow](/docs/architecture/flows/signing-flow) for transaction durability and retry behaviour.

## Failure response model [#failure-response-model]

DALP uses four recovery responses depending on the affected layer and whether continuing would be safe:

| Response        | When it applies                                                                                                                   | Operator expectation                                                                   |
| --------------- | --------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------- |
| Fail over       | A configured equivalent dependency can handle the request, such as another RPC endpoint or healthy application instance           | Confirm the healthy target is serving traffic and investigate the failed dependency    |
| Retry           | The platform has enough durable state to repeat a workflow step, signing request, transaction submission, or event handler safely | Monitor retry exhaustion, dependency recovery, and duplicate-suppression evidence      |
| Fail closed     | A required identity, compliance, authentication, signing, or data check cannot finish safely                                      | Treat the blocked request as protective behaviour, not a successful business operation |
| Manual recovery | The dependency, policy approval, database, or operating environment must be restored outside the affected workflow                | Follow the deployment runbook and verify state before resuming normal operations       |

<Mermaid
  chart="`flowchart TD
  incident[Component or dependency failure]
  detect[Detect through telemetry\nlogs, metrics, traces, dashboards, alerts]
  classify[Classify affected layer\nchain, workflow, indexer, API, custody, database]
  safe{Can DALP continue safely?}
  failover[Fail over to healthy dependency]
  retry[Retry from durable state]
  closed[Fail closed and block unsafe work]
  manual[Manual recovery by operator or external provider]
  evidence[Verify recovery evidence\ncurrent state, retry outcome, audit trail]

  incident --> detect --> classify --> safe
  safe -->|equivalent healthy dependency exists| failover
  safe -->|durable checkpoint exists| retry
  safe -->|required check cannot finish| closed
  safe -->|dependency must be restored| manual
  failover --> evidence
  retry --> evidence
  closed --> evidence
  manual --> evidence

`"
/>

## Failure mode catalog [#failure-mode-catalog]

### Blockchain layer [#blockchain-layer]

| Failure                                           | User-facing impact                                         | DALP behaviour                                                                                       | Recovery path                                                             |
| ------------------------------------------------- | ---------------------------------------------------------- | ---------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------- |
| One RPC endpoint is unreachable                   | Chain reads or writes may slow down                        | The network transport can use configured fallback RPC URLs with retry and backoff                    | Automatic failover when another configured endpoint is healthy            |
| All configured RPC endpoints are unavailable      | New chain reads and transaction submission cannot complete | Work that depends on chain access waits or fails according to the calling workflow                   | Restore RPC connectivity, then verify queued or retried operations        |
| Block reorganisation                              | Indexed data can temporarily reflect reverted transactions | Indexer tests cover reorg handling and idempotent replay for affected event state                    | Reprocess from the corrected chain state and verify indexed data          |
| Gas price spike or transaction submission failure | Transaction confirmation may be delayed                    | Signing and submission flows estimate gas and retry failed submission steps where safe               | Automatic retry where configured; operator review if retries exhaust      |
| Nonce conflict                                    | A transaction can be rejected by the network               | The signing flow serialises and retries transaction work instead of treating the conflict as success | Retry from the signing workflow and verify the final on-chain transaction |

### Workflow and execution layer [#workflow-and-execution-layer]

| Failure                           | User-facing impact                             | DALP behaviour                                                                                  | Recovery path                                                                   |
| --------------------------------- | ---------------------------------------------- | ----------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------- |
| Durable workflow runtime restart  | In-flight workflow steps pause                 | Persisted workflow state lets work resume after the runtime is available again                  | Restart the runtime and verify the workflow resumes or reaches a terminal state |
| Workflow step failure             | One multi-step operation is delayed or blocked | The failed step retries according to the workflow policy and preserves previous completed steps | Automatic retry first; manual intervention if the step cannot complete          |
| Workflow database connection loss | Workflow state cannot be checkpointed or read  | Workflow operations that need state cannot safely advance                                       | Restore database connectivity, then verify workflow state before retrying       |

### Indexer layer [#indexer-layer]

| Failure                               | User-facing impact                                    | DALP behaviour                                                                               | Recovery path                                                                     |
| ------------------------------------- | ----------------------------------------------------- | -------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------- |
| Indexer stops during block processing | Read models and dashboards can lag behind chain state | The indexer resumes from persisted progress and avoids duplicate event effects during replay | Restart the indexer and compare indexed state with chain state                    |
| Event handler failure                 | One event family may be stale while others continue   | Handler retries keep duplicate processing from becoming a second business event              | Fix the handler or dependency, then replay and verify the affected records        |
| RPC rate limit during indexing        | Indexing slows down                                   | Network configuration includes retry backoff and rate-limit settings for log fetching        | Reduce concurrency, raise provider limits, or add capacity, then monitor catch-up |

### API and application layer [#api-and-application-layer]

| Failure                                         | User-facing impact                          | DALP behaviour                                                                          | Recovery path                                                                |
| ----------------------------------------------- | ------------------------------------------- | --------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------- |
| API instance is unhealthy                       | Requests routed to that instance fail       | Readiness and health endpoints let the platform route traffic only to healthy instances | Restart or replace the unhealthy instance and confirm readiness              |
| Authentication or authorisation cannot complete | Users cannot start new protected operations | Access fails closed rather than granting unauthenticated or unauthorised access         | Restore the identity dependency and confirm the user's effective permissions |
| Database is unreachable                         | API operations that need current data fail  | Data-dependent operations return errors instead of inventing state                      | Restore database connectivity and check the affected operation again         |

### Custody and signing layer [#custody-and-signing-layer]

| Failure                             | User-facing impact                                   | DALP behaviour                                                                        | Recovery path                                                                                    |
| ----------------------------------- | ---------------------------------------------------- | ------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ |
| Custody provider is unreachable     | Transactions that require a signature cannot proceed | Signing work waits or retries; DALP does not skip the signature requirement           | Restore the provider connection and verify the pending transaction state                         |
| Custody policy blocks a transaction | The transaction remains pending or rejected          | DALP surfaces the policy state instead of bypassing the provider policy               | Approve, reject, or adjust the policy in the custody system according to the operating procedure |
| Signing timeout                     | Transaction submission is delayed                    | The signing flow can retry the signing request where the workflow has preserved state | Confirm whether a signature was produced, then retry or reconcile the transaction                |

## Degradation principles [#degradation-principles]

DALP favours protective degradation over silent continuation:

* Compliance and eligibility checks that cannot finish block the affected transfer or issuance request.
* Authentication and authorisation failures deny access instead of granting temporary privileges.
* Signing failures keep the transaction pending or failed; they do not create an unsigned shortcut.
* Read models can be stale during indexing or RPC disruption, so operators should compare freshness signals before acting on dashboards.
* External dependencies such as RPC providers, custody providers, identity providers, and database infrastructure must be restored by the deployment operator or provider owner.

## How to use this page during a review [#how-to-use-this-page-during-a-review]

1. Identify the affected layer from observability evidence.
2. Check whether the expected response is failover, retry, fail-closed blocking, or manual recovery.
3. Follow the matching operational runbook for the deployment environment.
4. Verify recovery with current telemetry, workflow state, indexed data, and audit records.
5. Use the [high availability](/docs/architecture/self-hosting/high-availability) pages to compare the measured recovery against the deployment's RTO and RPO targets.

## See also [#see-also]

* [Observability](/docs/architecture/operability/observability) for detection and alerting.
* [Database](/docs/architecture/operability/database) for persistence, backup, and restore evidence.
* [Execution Engine](/docs/architecture/components/infrastructure/execution-engine) for workflow durability.
* [Chain Gateway](/docs/architecture/components/infrastructure/chain-gateway) for blockchain connectivity and failover.