High availability
How self-hosted DALP operators choose a high availability and disaster recovery pattern, with recovery metrics, ownership boundaries, and links to the supported deployment scenarios.
Self-hosted DALP deployments need an availability pattern before production workloads go live. Start with a cloud-native, multi-zone design when the operator can use managed Kubernetes, managed PostgreSQL, managed cache, and object storage. Move to hot-warm, hot-cold, or hot-hot only when recovery targets, geographic requirements, or cost constraints justify the extra operating model.
The cloud-native pattern is the baseline for most self-hosted deployments. Managed services carry common failover mechanics for PostgreSQL, cache, and storage while keeping the monthly operating burden lower than self-managed cross-region patterns.
What the availability pattern covers
A high availability pattern decides how DALP keeps the application, database, cache, object storage, backups, indexing, chain access, and dependent infrastructure usable when an infrastructure component fails.
This overview is for platform operators preparing the deployment, buyers comparing operating models, and security or risk reviewers checking recovery ownership.
The pattern does not replace the operator's incident process, custody procedures, client communication plan, or post-incident validation. Before deployment, assign owners for failover, restore testing, monitoring, and post-incident checks. The selected pattern only works when those duties are staffed and tested.
For the recommended baseline, read cloud-native next.
Responsibility boundary
Availability evidence has three layers. Keep them separate when you review a production deployment.
| Layer | What it covers | What it does not cover |
|---|---|---|
| Platform capability | DALP application services, transaction workflow state, chain gateway configuration, chain indexing, and recovery runbooks | A guaranteed uptime percentage by itself |
| Infrastructure dependency | Kubernetes or OpenShift, PostgreSQL, cache, object storage, RPC endpoints, custody provider access, network routing, and observability systems | Provider SLAs or contractual remedies |
| Contractual SLA | The commercial availability commitment agreed for the deployment and its supporting providers | A technical failover design or restore drill result |
DALP availability depends on the configured infrastructure and provider contracts. Do not turn a deployment pattern, an RTO target, or a successful drill into an SLA unless the contractual SLA, cloud provider SLA, custody provider SLA, and RPC provider SLA support that commitment.
The pattern has two loops. The live-service loop keeps ingress, DALP services, PostgreSQL, Redis, and object storage available during normal operation. The recovery-evidence loop proves that backups, restore access, monitoring, and owner sign-off can recover the service within the selected RTO while backup and replication checks validate the selected RPO.
Recovery metrics
| Metric | Meaning | How to use it |
|---|---|---|
| RTO | Maximum acceptable downtime | Set the target before choosing the pattern |
| RPO | Maximum acceptable data loss | Match the target to database, cache, object storage, and logs |
| RTT | Measured recovery time after testing | Record it during restore drills and compare it with the RTO |
RTO and RPO are targets. RTT is evidence. A pattern is not production-ready until the operator has run a recovery drill, measured the restore time, validated application and data state, and accepted any gap between the target and the tested result.
Choose a deployment scenario
Use the scenario table as an operating model filter. The RTO and RPO ranges are planning targets for the selected pattern. The ranges become evidence only after the operator runs the matching restore or failover drill, records the achieved recovery time, and validates the data-loss window from backup age, replication lag, or restored data timestamps.
| Scenario | RTO target | RPO target | Monthly effort | Use when |
|---|---|---|---|---|
| Cloud-native | 2 to 15 minutes | Seconds to 1 min | 8 to 16 hours | Most self-hosted deployments |
| Hot-warm | 30 to 180 minutes | 5 to 60 minutes | 25 to 40 hours | You need geographic redundancy |
| Hot-cold | 8 to 72 hours | 4 to 24 hours | 10 to 20 hours | Cost matters more than fast recovery |
| Hot-hot for consortium networks | 1 to 10 minutes | Seconds to minutes | 40 to 60 hours | Multiple active regions share responsibility |
| Hot-hot for public networks | 1 to 10 minutes | 1 to 5 minutes | 20 to 30 hours | On-chain state can be re-derived after outage |
Start with the cloud-native pattern unless a specific requirement rules it out. Managed services carry the failover mechanics for PostgreSQL, cache, and storage.
Use the alternative patterns when the operator accepts the extra runbook, monitoring, and drill burden. Document that acceptance before production so the recovery pattern, staffing model, and contractual SLA do not drift apart.
Kubernetes high availability assumptions
Production self-hosting assumes a Kubernetes or OpenShift cluster that can keep workloads scheduled during a zone or node failure. The baseline is at least three availability zones, enough worker capacity to reschedule pods after one zone is unavailable, standard topology labels, pod disruption budgets, and a load balancer or route layer that can send traffic only to healthy pods.
Control-plane availability belongs to the cluster provider or the operator's Kubernetes platform team. The DALP pattern assumes the control plane remains reachable for scheduling, rollout, and failover work during an incident. If the control plane is self-managed, document the control-plane quorum, backup, restore, and upgrade procedure before production.
Failover triggers must be observable. Treat pod crash loops, node readiness loss, zone unavailability, PostgreSQL primary failover, cache primary failover, object storage access failure, RPC endpoint failure, indexer lag, and queue backlog as conditions that can start the incident runbook.
Chain access and indexing recovery
DALP uses EVM RPC access and the chain indexer as part of the availability boundary. Chain access and indexed views need separate redundancy checks because RPC endpoints and indexers fail differently from application pods.
| Area | Availability expectation | Recovery expectation |
|---|---|---|
| Blockchain node or RPC access | Configure at least two reachable RPC endpoints or providers for each production network when supported by the network design. Keep provider limits, block-range limits, and authentication material documented. | Fail over reads, writes, subscriptions, and log fetching to a healthy endpoint. Validate transaction broadcast and chain reads after failover. |
| Chain Gateway | Keep network configuration, gas settings, finality depth, and fallback endpoints aligned with the selected EVM network. | Re-test transaction submission and status reads before ending the incident. |
| Chain Indexer | Monitor indexer process health, block lag, reorg handling, and per-chain indexing state. | Restart or redeploy the indexer, replay from the last safe checkpoint, and re-process affected blocks when a reorg invalidates previously indexed logs. |
| Event consumers | Treat provisional, final, retracted, and recalled events as separate operational states. | Reconcile downstream systems against the final indexed state after replay or reorg recovery. |
Disaster recovery region coverage
A regional disaster recovery plan needs at least two region roles: one primary service region and one recovery region. The primary region serves production traffic. The recovery region holds the infrastructure, secrets access, database restore path, object storage replication or backup access, RPC configuration, and runbooks needed to resume service.
Hot-warm is active-passive. The primary region serves traffic and the recovery region stays ready for promotion. Failover is a controlled promotion of database, cache, application, ingress, RPC, indexer, and operational ownership to the recovery region.
Hot-cold is restore-based active-passive. The recovery region may not run the full stack until an incident. The runbook must prove that backups, images, secrets, DNS or ingress, and provider access can recreate the service inside the accepted RTO and RPO.
Hot-hot is active-active. More than one region serves traffic at the same time. Use this pattern only when the operator can handle multi-cluster routing, data consistency, indexed-state reconciliation, and provider limits. The operator also needs incident ownership across active regions.
Monitoring and alerting expectations
Monitoring must cover the platform, infrastructure, and external dependencies that make the selected availability pattern work. At minimum, alert on:
- API availability, request latency, error rate, and authentication failures
- pod restarts, crash loops, node readiness, zone health, and ingress or route health
- PostgreSQL failover state, replication lag, connection pressure, backup success, and restore-test age
- cache primary health, memory pressure, persistence status, and failover state
- object storage availability, backup write status, replication status, and restore access
- queue backlog, worker health, transaction workflow age, and stuck execution states
- RPC endpoint availability, chain head age, block lag, provider errors, and provider rate limiting
- indexer process health, per-chain indexing lag, replay progress, and reorg or retraction events
- custody or HSM provider reachability for signing-dependent workflows
Every alert needs an owner, severity, runbook link, escalation path, and test cadence. Dashboards are evidence only when alerts fire, route to the right owner, and drive a tested response.
Production checks before go-live
Before treating the deployment as production-ready, confirm that the operator has:
| Check | Evidence to keep |
|---|---|
| Nodes are distributed across at least three availability zones, as required by the self-hosting prerequisites | Cluster topology, scheduling capacity, and pod disruption budget review |
| Managed PostgreSQL high availability or an equivalent PostgreSQL failover design is configured | Provider failover setting, replica status, PITR configuration, and restore-test result |
| Cache redundancy and TLS encryption are configured | Cache topology, persistence mode, certificate path, and failover-test result |
| Object storage backups and restore access are configured | Bucket policy, versioning or lifecycle rule, backup write result, and restore credential test |
| At least one backup restore test has run | Drill timestamp, restored namespace inventory, application health checks, and measured RTT |
| Monitoring alerts cover API availability, database health, cache health, queue lag, storage access, and backup status | Alert list with owner, severity, escalation path, and last test result |
| Incident owners are assigned for failover, restore, validation, and client communication | Runbook owner list and escalation rota |
Next pages
- Buyers comparing operating models should start with cloud-native for the recommended baseline, then compare hot-warm, hot-cold, and hot-hot when geography, cost, or active-active operation changes the decision.
- Platform operators preparing a deployment should read self-hosting prerequisites, choose the matching pattern page, and use backup and recovery to plan restore tests and recovery drills.
- Security and risk reviewers should use the pattern page to confirm the failover boundary, then review backup and recovery for restore ownership, validation, and drill evidence.
OpenShift installation
OpenShift deployment guidance for self-hosted DALP environments that use restricted SCCs, Routes, and OpenShift Data Foundation or another CSI-backed storage class.
Cloud-native high availability
Use managed Kubernetes, managed data services, multi-zone placement, health probes, and backup tooling as the default high availability pattern for self-hosted DALP deployments.