SettleMint
ArchitectureSelf-HostingHigh Availability

High availability

How self-hosted DALP operators choose a high availability and disaster recovery pattern, with recovery metrics, ownership boundaries, and links to the supported deployment scenarios.

Self-hosted DALP deployments need an availability pattern before production workloads go live. Start with a cloud-native, multi-zone design when the operator can use managed Kubernetes, managed PostgreSQL, managed cache, and object storage. Move to hot-warm, hot-cold, or hot-hot only when recovery targets, geographic requirements, or cost constraints justify the extra operating model.

The cloud-native pattern is the baseline for most self-hosted deployments. Managed services carry common failover mechanics for PostgreSQL, cache, and storage while keeping the monthly operating burden lower than self-managed cross-region patterns.

What the availability pattern covers

A high availability pattern decides how DALP keeps the application, database, cache, object storage, backups, indexing, chain access, and dependent infrastructure usable when an infrastructure component fails.

This overview is for platform operators preparing the deployment, buyers comparing operating models, and security or risk reviewers checking recovery ownership.

The pattern does not replace the operator's incident process, custody procedures, client communication plan, or post-incident validation. Before deployment, assign owners for failover, restore testing, monitoring, and post-incident checks. The selected pattern only works when those duties are staffed and tested.

For the recommended baseline, read cloud-native next.

Responsibility boundary

Availability evidence has three layers. Keep them separate when you review a production deployment.

LayerWhat it coversWhat it does not cover
Platform capabilityDALP application services, transaction workflow state, chain gateway configuration, chain indexing, and recovery runbooksA guaranteed uptime percentage by itself
Infrastructure dependencyKubernetes or OpenShift, PostgreSQL, cache, object storage, RPC endpoints, custody provider access, network routing, and observability systemsProvider SLAs or contractual remedies
Contractual SLAThe commercial availability commitment agreed for the deployment and its supporting providersA technical failover design or restore drill result

DALP availability depends on the configured infrastructure and provider contracts. Do not turn a deployment pattern, an RTO target, or a successful drill into an SLA unless the contractual SLA, cloud provider SLA, custody provider SLA, and RPC provider SLA support that commitment.

Rendering diagram...

The pattern has two loops. The live-service loop keeps ingress, DALP services, PostgreSQL, Redis, and object storage available during normal operation. The recovery-evidence loop proves that backups, restore access, monitoring, and owner sign-off can recover the service within the selected RTO while backup and replication checks validate the selected RPO.

Rendering diagram...

Recovery metrics

MetricMeaningHow to use it
RTOMaximum acceptable downtimeSet the target before choosing the pattern
RPOMaximum acceptable data lossMatch the target to database, cache, object storage, and logs
RTTMeasured recovery time after testingRecord it during restore drills and compare it with the RTO

RTO and RPO are targets. RTT is evidence. A pattern is not production-ready until the operator has run a recovery drill, measured the restore time, validated application and data state, and accepted any gap between the target and the tested result.

Choose a deployment scenario

Use the scenario table as an operating model filter. The RTO and RPO ranges are planning targets for the selected pattern. The ranges become evidence only after the operator runs the matching restore or failover drill, records the achieved recovery time, and validates the data-loss window from backup age, replication lag, or restored data timestamps.

ScenarioRTO targetRPO targetMonthly effortUse when
Cloud-native2 to 15 minutesSeconds to 1 min8 to 16 hoursMost self-hosted deployments
Hot-warm30 to 180 minutes5 to 60 minutes25 to 40 hoursYou need geographic redundancy
Hot-cold8 to 72 hours4 to 24 hours10 to 20 hoursCost matters more than fast recovery
Hot-hot for consortium networks1 to 10 minutesSeconds to minutes40 to 60 hoursMultiple active regions share responsibility
Hot-hot for public networks1 to 10 minutes1 to 5 minutes20 to 30 hoursOn-chain state can be re-derived after outage

Start with the cloud-native pattern unless a specific requirement rules it out. Managed services carry the failover mechanics for PostgreSQL, cache, and storage.

Use the alternative patterns when the operator accepts the extra runbook, monitoring, and drill burden. Document that acceptance before production so the recovery pattern, staffing model, and contractual SLA do not drift apart.

Kubernetes high availability assumptions

Production self-hosting assumes a Kubernetes or OpenShift cluster that can keep workloads scheduled during a zone or node failure. The baseline is at least three availability zones, enough worker capacity to reschedule pods after one zone is unavailable, standard topology labels, pod disruption budgets, and a load balancer or route layer that can send traffic only to healthy pods.

Control-plane availability belongs to the cluster provider or the operator's Kubernetes platform team. The DALP pattern assumes the control plane remains reachable for scheduling, rollout, and failover work during an incident. If the control plane is self-managed, document the control-plane quorum, backup, restore, and upgrade procedure before production.

Failover triggers must be observable. Treat pod crash loops, node readiness loss, zone unavailability, PostgreSQL primary failover, cache primary failover, object storage access failure, RPC endpoint failure, indexer lag, and queue backlog as conditions that can start the incident runbook.

Chain access and indexing recovery

DALP uses EVM RPC access and the chain indexer as part of the availability boundary. Chain access and indexed views need separate redundancy checks because RPC endpoints and indexers fail differently from application pods.

AreaAvailability expectationRecovery expectation
Blockchain node or RPC accessConfigure at least two reachable RPC endpoints or providers for each production network when supported by the network design. Keep provider limits, block-range limits, and authentication material documented.Fail over reads, writes, subscriptions, and log fetching to a healthy endpoint. Validate transaction broadcast and chain reads after failover.
Chain GatewayKeep network configuration, gas settings, finality depth, and fallback endpoints aligned with the selected EVM network.Re-test transaction submission and status reads before ending the incident.
Chain IndexerMonitor indexer process health, block lag, reorg handling, and per-chain indexing state.Restart or redeploy the indexer, replay from the last safe checkpoint, and re-process affected blocks when a reorg invalidates previously indexed logs.
Event consumersTreat provisional, final, retracted, and recalled events as separate operational states.Reconcile downstream systems against the final indexed state after replay or reorg recovery.

Disaster recovery region coverage

A regional disaster recovery plan needs at least two region roles: one primary service region and one recovery region. The primary region serves production traffic. The recovery region holds the infrastructure, secrets access, database restore path, object storage replication or backup access, RPC configuration, and runbooks needed to resume service.

Hot-warm is active-passive. The primary region serves traffic and the recovery region stays ready for promotion. Failover is a controlled promotion of database, cache, application, ingress, RPC, indexer, and operational ownership to the recovery region.

Hot-cold is restore-based active-passive. The recovery region may not run the full stack until an incident. The runbook must prove that backups, images, secrets, DNS or ingress, and provider access can recreate the service inside the accepted RTO and RPO.

Hot-hot is active-active. More than one region serves traffic at the same time. Use this pattern only when the operator can handle multi-cluster routing, data consistency, indexed-state reconciliation, and provider limits. The operator also needs incident ownership across active regions.

Monitoring and alerting expectations

Monitoring must cover the platform, infrastructure, and external dependencies that make the selected availability pattern work. At minimum, alert on:

  • API availability, request latency, error rate, and authentication failures
  • pod restarts, crash loops, node readiness, zone health, and ingress or route health
  • PostgreSQL failover state, replication lag, connection pressure, backup success, and restore-test age
  • cache primary health, memory pressure, persistence status, and failover state
  • object storage availability, backup write status, replication status, and restore access
  • queue backlog, worker health, transaction workflow age, and stuck execution states
  • RPC endpoint availability, chain head age, block lag, provider errors, and provider rate limiting
  • indexer process health, per-chain indexing lag, replay progress, and reorg or retraction events
  • custody or HSM provider reachability for signing-dependent workflows

Every alert needs an owner, severity, runbook link, escalation path, and test cadence. Dashboards are evidence only when alerts fire, route to the right owner, and drive a tested response.

Production checks before go-live

Before treating the deployment as production-ready, confirm that the operator has:

CheckEvidence to keep
Nodes are distributed across at least three availability zones, as required by the self-hosting prerequisitesCluster topology, scheduling capacity, and pod disruption budget review
Managed PostgreSQL high availability or an equivalent PostgreSQL failover design is configuredProvider failover setting, replica status, PITR configuration, and restore-test result
Cache redundancy and TLS encryption are configuredCache topology, persistence mode, certificate path, and failover-test result
Object storage backups and restore access are configuredBucket policy, versioning or lifecycle rule, backup write result, and restore credential test
At least one backup restore test has runDrill timestamp, restored namespace inventory, application health checks, and measured RTT
Monitoring alerts cover API availability, database health, cache health, queue lag, storage access, and backup statusAlert list with owner, severity, escalation path, and last test result
Incident owners are assigned for failover, restore, validation, and client communicationRunbook owner list and escalation rota

Next pages

  • Buyers comparing operating models should start with cloud-native for the recommended baseline, then compare hot-warm, hot-cold, and hot-hot when geography, cost, or active-active operation changes the decision.
  • Platform operators preparing a deployment should read self-hosting prerequisites, choose the matching pattern page, and use backup and recovery to plan restore tests and recovery drills.
  • Security and risk reviewers should use the pattern page to confirm the failover boundary, then review backup and recovery for restore ownership, validation, and drill evidence.

On this page