DALP high availability: choose your HA strategy

How self-hosted DALP operators choose a high availability and disaster recovery pattern, with recovery metrics, ownership boundaries, and links to the supported deployment scenarios.

Self-hosted DALP deployments need an availability pattern before production workloads go live. Start with a cloud-native, multi-zone design when the operator can use managed Kubernetes, managed PostgreSQL, managed cache, and object storage. Move to hot-warm, hot-cold, or hot-hot only when recovery targets, geographic requirements, or cost constraints justify the extra operating model.

The cloud-native pattern is the baseline for most self-hosted deployments. Managed services carry common failover mechanics for PostgreSQL, cache, and storage while keeping the monthly operating burden lower than self-managed cross-region patterns.

What the availability pattern covers

A high availability pattern decides how DALP keeps the application, database, cache, object storage, backups, indexing, chain access, and dependent infrastructure usable when an infrastructure component fails.

This overview is for platform operators preparing the deployment, buyers comparing operating models, and security or risk reviewers checking recovery ownership.

The pattern does not replace the operator's incident process, custody procedures, client communication plan, or post-incident validation. Before deployment, assign owners for failover, restore testing, monitoring, and post-incident checks. The selected pattern only works when those duties are staffed and tested.

For the recommended baseline, read cloud-native next.

Responsibility boundary

Availability evidence has three layers. Keep them separate when you review a production deployment.

Layer	What it covers	What it does not cover
Platform capability	DALP application services, transaction workflow state, chain gateway configuration, chain indexing, and recovery runbooks	A guaranteed uptime percentage by itself
Infrastructure dependency	Kubernetes or OpenShift, PostgreSQL, cache, object storage, RPC endpoints, custody provider access, network routing, and observability systems	Provider SLAs or contractual remedies
Contractual SLA	The commercial availability commitment agreed for the deployment and its supporting providers	A technical failover design or restore drill result

DALP availability depends on the configured infrastructure and provider contracts. Do not turn a deployment pattern, an RTO target, or a successful drill into an SLA unless the contractual SLA, cloud provider SLA, custody provider SLA, and RPC provider SLA support that commitment.

Rendering diagram...

The pattern has two loops. The live-service loop keeps ingress, DALP services, PostgreSQL, Redis, and object storage available during normal operation. The recovery-evidence loop proves that backups, restore access, monitoring, and owner sign-off can recover the service within the selected RTO while backup and replication checks validate the selected RPO.

Rendering diagram...

Recovery metrics

Metric	Meaning	How to use it
RTO	Maximum acceptable downtime	Set the target before choosing the pattern
RPO	Maximum acceptable data loss	Match the target to database, cache, object storage, and logs
RTT	Measured recovery time after testing	Record it during restore drills and compare it with the RTO

RTO and RPO are targets. RTT is evidence. A pattern is not production-ready until the operator has run a recovery drill, measured the restore time, validated application and data state, and accepted any gap between the target and the tested result.

Choose a deployment scenario

Use the scenario table as an operating model filter. The RTO and RPO ranges are planning targets for the selected pattern. The ranges become evidence only after the operator runs the matching restore or failover drill, records the achieved recovery time, and validates the data-loss window from backup age, replication lag, or restored data timestamps.

Scenario	RTO target	RPO target	Monthly effort	Use when
Cloud-native	2 to 15 minutes	Seconds to 1 min	8 to 16 hours	Most self-hosted deployments
Hot-warm	30 to 180 minutes	5 to 60 minutes	25 to 40 hours	You need geographic redundancy
Hot-cold	8 to 72 hours	4 to 24 hours	10 to 20 hours	Cost matters more than fast recovery
Hot-hot for consortium networks	1 to 10 minutes	Seconds to minutes	40 to 60 hours	Multiple active regions share responsibility
Hot-hot for public networks	1 to 10 minutes	1 to 5 minutes	20 to 30 hours	On-chain state can be re-derived after outage

Start with the cloud-native pattern unless a specific requirement rules it out. Managed services carry the failover mechanics for PostgreSQL, cache, and storage.

Use the alternative patterns when the operator accepts the extra runbook, monitoring, and drill burden. Document that acceptance before production so the recovery pattern, staffing model, and contractual SLA do not drift apart.

Kubernetes high availability assumptions

Production self-hosting assumes a Kubernetes or OpenShift cluster that can keep workloads scheduled during a zone or node failure. The baseline is at least three availability zones, enough worker capacity to reschedule pods after one zone is unavailable, standard topology labels, pod disruption budgets, and a load balancer or route layer that can send traffic only to healthy pods.

Control-plane availability belongs to the cluster provider or the operator's Kubernetes platform team. The DALP pattern assumes the control plane remains reachable for scheduling, rollout, and failover work during an incident. If the control plane is self-managed, document the control-plane quorum, backup, restore, and upgrade procedure before production.

Failover triggers must be observable. Treat pod crash loops, node readiness loss, zone unavailability, PostgreSQL primary failover, cache primary failover, object storage access failure, RPC endpoint failure, indexer lag, and queue backlog as conditions that can start the incident runbook.

Chain access and indexing recovery

DALP uses EVM RPC access and the chain indexer as part of the availability boundary. Chain access and indexed views need separate redundancy checks because RPC endpoints and indexers fail differently from application pods.

Area	Availability expectation	Recovery expectation
Blockchain node or RPC access	Configure at least two reachable RPC endpoints or providers for each production network when supported by the network design. Keep provider limits, block-range limits, and authentication material documented.	Fail over reads, writes, subscriptions, and log fetching to a healthy endpoint. Validate transaction broadcast and chain reads after failover.
Chain Gateway	Keep network configuration, gas settings, finality depth, and fallback endpoints aligned with the selected EVM network.	Re-test transaction submission and status reads before ending the incident.
Chain Indexer	Monitor indexer process health, block lag, reorg handling, and per-chain indexing state.	Restart or redeploy the indexer, replay from the last safe checkpoint, and re-process affected blocks when a reorg invalidates previously indexed logs.
Event consumers	Treat provisional, final, retracted, and recalled events as separate operational states.	Reconcile downstream systems against the final indexed state after replay or reorg recovery.

Disaster recovery region coverage

A regional disaster recovery plan needs at least two region roles: one primary service region and one recovery region. The primary region serves production traffic. The recovery region holds the infrastructure, secrets access, database restore path, object storage replication or backup access, RPC configuration, and runbooks needed to resume service.

Hot-warm is active-passive. The primary region serves traffic and the recovery region stays ready for promotion. Failover is a controlled promotion of database, cache, application, ingress, RPC, indexer, and operational ownership to the recovery region.

Hot-cold is restore-based active-passive. The recovery region may not run the full stack until an incident. The runbook must prove that backups, images, secrets, DNS or ingress, and provider access can recreate the service inside the accepted RTO and RPO.

Hot-hot is active-active. More than one region serves traffic at the same time. Use this pattern only when the operator can handle multi-cluster routing, data consistency, indexed-state reconciliation, and provider limits. The operator also needs incident ownership across active regions.

Monitoring and alerting expectations

Monitoring must cover the platform, infrastructure, and external dependencies that make the selected availability pattern work. At minimum, alert on:

API availability, request latency, error rate, and authentication failures
pod restarts, crash loops, node readiness, zone health, and ingress or route health
PostgreSQL failover state, replication lag, connection pressure, backup success, and restore-test age
cache primary health, memory pressure, persistence status, and failover state
object storage availability, backup write status, replication status, and restore access
queue backlog, worker health, transaction workflow age, and stuck execution states
RPC endpoint availability, chain head age, block lag, provider errors, and provider rate limiting
indexer process health, per-chain indexing lag, replay progress, and reorg or retraction events
custody or HSM provider reachability for signing-dependent workflows

Every alert needs an owner, severity, runbook link, escalation path, and test cadence. Dashboards are evidence only when alerts fire, route to the right owner, and drive a tested response.

Production checks before go-live

Before treating the deployment as production-ready, confirm that the operator has:

Check	Evidence to keep
Nodes are distributed across at least three availability zones, as required by the self-hosting prerequisites	Cluster topology, scheduling capacity, and pod disruption budget review
Managed PostgreSQL high availability or an equivalent PostgreSQL failover design is configured	Provider failover setting, replica status, PITR configuration, and restore-test result
Cache redundancy and TLS encryption are configured	Cache topology, persistence mode, certificate path, and failover-test result
Object storage backups and restore access are configured	Bucket policy, versioning or lifecycle rule, backup write result, and restore credential test
At least one backup restore test has run	Drill timestamp, restored namespace inventory, application health checks, and measured RTT
Monitoring alerts cover API availability, database health, cache health, queue lag, storage access, and backup status	Alert list with owner, severity, escalation path, and last test result
Incident owners are assigned for failover, restore, validation, and client communication	Runbook owner list and escalation rota

Buyers comparing operating models should start with cloud-native for the recommended baseline, then compare hot-warm, hot-cold, and hot-hot when geography, cost, or active-active operation changes the decision.
Platform operators preparing a deployment should read self-hosting prerequisites, choose the matching pattern page, and use backup and recovery to plan restore tests and recovery drills.
Security and risk reviewers should use the pattern page to confirm the failover boundary, then review backup and recovery for restore ownership, validation, and drill evidence.

High availability