SettleMint
ArchitectureOperability

Failure Modes

Architecture-level failure-mode reference for DALP deployments, covering how platform components degrade, what operators can detect, and which recovery path applies when dependencies are unavailable.

DALP failure handling separates three questions you need to answer during an incident: what is affected, whether the platform can continue safely, and who must restore the dependency. Security-sensitive work fails closed when required checks or signatures cannot complete. Durable workflows, idempotent processing, and RPC failover limit the blast radius where the platform has enough state to retry safely.

Use this page as an architecture reference, not as an incident runbook. The catalog helps you connect alerts, logs, workflow state, and high availability plans to the affected component.

Related pages:

  • Observability for metrics, logs, traces, dashboards, and alerts.
  • Database for PostgreSQL persistence, backups, and restore planning.
  • High availability for deployment patterns, RTO, RPO, and recovery drills.
  • Signing flow for transaction durability and retry behaviour.

Failure response model

DALP uses four recovery responses depending on the affected layer and whether continuing would be safe:

ResponseWhen it appliesOperator expectation
Fail overA configured equivalent dependency can handle the request, such as another RPC endpoint or healthy application instanceConfirm the healthy target is serving traffic and investigate the failed dependency
RetryThe platform has enough durable state to repeat a workflow step, signing request, transaction submission, or event handler safelyMonitor retry exhaustion, dependency recovery, and duplicate-suppression evidence
Fail closedA required identity, compliance, authentication, signing, or data check cannot finish safelyTreat the blocked request as protective behaviour, not a successful business operation
Manual recoveryThe dependency, policy approval, database, or operating environment must be restored outside the affected workflowFollow the deployment runbook and verify state before resuming normal operations
Rendering diagram...

Failure mode catalog

Blockchain layer

FailureUser-facing impactDALP behaviourRecovery path
One RPC endpoint is unreachableChain reads or writes may slow downThe network transport can use configured fallback RPC URLs with retry and backoffAutomatic failover when another configured endpoint is healthy
All configured RPC endpoints are unavailableNew chain reads and transaction submission cannot completeWork that depends on chain access waits or fails according to the calling workflowRestore RPC connectivity, then verify queued or retried operations
Block reorganisationIndexed data can temporarily reflect reverted transactionsIndexer tests cover reorg handling and idempotent replay for affected event stateReprocess from the corrected chain state and verify indexed data
Gas price spike or transaction submission failureTransaction confirmation may be delayedSigning and submission flows estimate gas and retry failed submission steps where safeAutomatic retry where configured; operator review if retries exhaust
Nonce conflictA transaction can be rejected by the networkThe signing flow serialises and retries transaction work instead of treating the conflict as successRetry from the signing workflow and verify the final on-chain transaction

Workflow and execution layer

FailureUser-facing impactDALP behaviourRecovery path
Durable workflow runtime restartIn-flight workflow steps pausePersisted workflow state lets work resume after the runtime is available againRestart the runtime and verify the workflow resumes or reaches a terminal state
Workflow step failureOne multi-step operation is delayed or blockedThe failed step retries according to the workflow policy and preserves previous completed stepsAutomatic retry first; manual intervention if the step cannot complete
Workflow database connection lossWorkflow state cannot be checkpointed or readWorkflow operations that need state cannot safely advanceRestore database connectivity, then verify workflow state before retrying

Indexer layer

FailureUser-facing impactDALP behaviourRecovery path
Indexer stops during block processingRead models and dashboards can lag behind chain stateThe indexer resumes from persisted progress and avoids duplicate event effects during replayRestart the indexer and compare indexed state with chain state
Event handler failureOne event family may be stale while others continueHandler retries keep duplicate processing from becoming a second business eventFix the handler or dependency, then replay and verify the affected records
RPC rate limit during indexingIndexing slows downNetwork configuration includes retry backoff and rate-limit settings for log fetchingReduce concurrency, raise provider limits, or add capacity, then monitor catch-up

API and application layer

FailureUser-facing impactDALP behaviourRecovery path
API instance is unhealthyRequests routed to that instance failReadiness and health endpoints let the platform route traffic only to healthy instancesRestart or replace the unhealthy instance and confirm readiness
Authentication or authorisation cannot completeUsers cannot start new protected operationsAccess fails closed rather than granting unauthenticated or unauthorised accessRestore the identity dependency and confirm the user's effective permissions
Database is unreachableAPI operations that need current data failData-dependent operations return errors instead of inventing stateRestore database connectivity and check the affected operation again

Custody and signing layer

FailureUser-facing impactDALP behaviourRecovery path
Custody provider is unreachableTransactions that require a signature cannot proceedSigning work waits or retries; DALP does not skip the signature requirementRestore the provider connection and verify the pending transaction state
Custody policy blocks a transactionThe transaction remains pending or rejectedDALP surfaces the policy state instead of bypassing the provider policyApprove, reject, or adjust the policy in the custody system according to the operating procedure
Signing timeoutTransaction submission is delayedThe signing flow can retry the signing request where the workflow has preserved stateConfirm whether a signature was produced, then retry or reconcile the transaction

Degradation principles

DALP favours protective degradation over silent continuation:

  • Compliance and eligibility checks that cannot finish block the affected transfer or issuance request.
  • Authentication and authorisation failures deny access instead of granting temporary privileges.
  • Signing failures keep the transaction pending or failed; they do not create an unsigned shortcut.
  • Read models can be stale during indexing or RPC disruption, so operators should compare freshness signals before acting on dashboards.
  • External dependencies such as RPC providers, custody providers, identity providers, and database infrastructure must be restored by the deployment operator or provider owner.

How to use this page during a review

  1. Identify the affected layer from observability evidence.
  2. Check whether the expected response is failover, retry, fail-closed blocking, or manual recovery.
  3. Follow the matching operational runbook for the deployment environment.
  4. Verify recovery with current telemetry, workflow state, indexed data, and audit records.
  5. Use the high availability pages to compare the measured recovery against the deployment's RTO and RPO targets.

See also

On this page