DALP hot-hot HA: active-active multi-cluster operation

Compare DALP hot-hot and hybrid multi-region deployment patterns for consortium and public EVM networks, including provider patterns, outage behaviour, recovery targets, and when to choose this model.

Hot-hot is DALP's active-active availability pattern. More than one region can serve traffic at the same time, so the design reduces user-facing failover time but raises the operating burden for traffic routing, data consistency, indexed-state reconciliation, and incident ownership.

Choose hot-hot only after comparing the simpler HA patterns: cloud-native, hot-warm, and hot-cold. For many self-hosted deployments, cloud-native multi-zone HA or hot-warm is enough.

Choose the correct hot-hot variant

DALP uses the same active-active idea in two different operating models:

Variant	Use when	Recovery model	Main operator burden
Consortium network	You operate validators or validator-adjacent infrastructure across regions.	Keep validators, RPC access, DALP services, and PostgreSQL topology healthy across clusters.	Consensus participation, validator placement, cross-cluster database design, and regional traffic routing.
Public EVM network	The chain is external and DALP reads on-chain truth through RPC and DIDX indexing.	Shift user traffic to a healthy cluster and rebuild indexed state from the chain when needed.	RPC availability, DIDX sync health, database failover, and reconciliation after cluster failover.

Hybrid multi-region deployments

DALP supports hybrid deployments where core platform services remain in an on-premises Kubernetes or OpenShift estate while blockchain access, RPC nodes, validator-adjacent infrastructure, or DIDX indexing capacity runs in cloud regions. The supported cloud families are AWS, Azure, and GCP when the selected regions provide the required managed Kubernetes, PostgreSQL, Redis, object storage, backup, and observability services. SettleMint confirms the exact provider regions during deployment planning because region availability belongs to the selected cloud account and regulatory boundary.

Rendering diagram...

Treat the hybrid split as an operating boundary, not only as a network diagram. Each side needs clear ownership, health checks, credentials, route failover, and recovery evidence.

Surface	What can be split across estates	What must stay aligned
DALP services	dApp, API, workers, ingress, and observability can run in the primary application cluster or on-premises estate.	Chart values, secrets, PostgreSQL connectivity, Redis connectivity, object storage access, and route health.
Blockchain nodes	Consortium deployments can run validators and RPC nodes in separate regions or clusters when the network design supports regional node placement. Public-network deployments use external RPC access instead of operating public-chain validators.	Chain ID, genesis or network configuration, finality assumptions, RPC authentication, provider limits, and failover runbooks.
Chain Indexer	DIDX can run beside the DALP services or in a cloud estate with RPC access. It can rebuild chain-derived state from the chain and PostgreSQL checkpoints.	Per-chain checkpoints, block lag, reorg handling, registered contract coverage, and indexed-state validation after failover.
Data services	PostgreSQL, cache, object storage, and backups can use managed cloud services or approved self-hosted services.	HA mode, replication lag, backup retention, restore access, and tested application reconnection.

Supported cloud provider pattern

Use AWS, Azure, or GCP regions that meet the self-hosting prerequisites. DALP does not require a fixed region list. The deployment must use region pairs or recovery regions approved by the operator, the cloud account, and the data-residency requirement.

Provider family	Cloud services used in the pattern	Region requirement
AWS	EKS or OpenShift, RDS PostgreSQL Multi-AZ, ElastiCache Multi-AZ, S3, CloudWatch, Managed Prometheus, and Managed Grafana.	Choose primary and recovery regions where these services are available and approved for the deployment.
Azure	AKS or OpenShift, Azure Database for PostgreSQL Flexible Server with zone-redundant HA, Azure Cache for Redis, Blob Storage, Azure Monitor, and Managed Grafana.	Choose primary and recovery regions where these services are available and approved for the deployment.
GCP	GKE or OpenShift, Cloud SQL Regional HA, Memorystore Standard tier, Cloud Storage, Cloud Monitoring, and Cloud Logging.	Choose primary and recovery regions where these services are available and approved for the deployment.

Regional cloud outage behaviour

During a cloud-region outage, DALP can continue operating through the remaining healthy region only for the surfaces that have been deployed and tested in that second region. If the cloud-hosted node or indexer is single-region, the on-premises DALP estate can stay up, but chain reads, chain writes, and indexed-state freshness depend on restoring RPC and DIDX access.

Surface	Failover behaviour	RTO expectation	RPO expectation
RPC nodes or external RPC	Route DALP to the healthy RPC endpoint or provider region after health checks fail.	1 to 10 minutes	Seconds to minutes for endpoint freshness; 0 for on-chain state because the EVM chain remains authoritative.
Consortium validators	Surviving validators keep the network healthy only when the consensus design tolerates the failed region.	1 to 10 minutes	Seconds to minutes, depending on finality and database replication lag.
DIDX indexer with healthy RPC	The indexer resumes from the last checkpoint and catches up from chain data.	1 to 10 minutes	Seconds to minutes for checkpointed indexed state.
DIDX full rebuild	Rebuild indexed state from the chain when checkpointed state or the indexed database cannot be trusted.	5 to 60 minutes	Not applicable to on-chain truth.
On-premises DALP services	The application estate stays available if its database, cache, routes, and secrets remain healthy.	1 to 10 minutes	Seconds to minutes, depending on database, cache, and route failover state.

Do not treat a healthy blockchain node as proof that the application estate is healthy. Do not treat a healthy application pod as proof that RPC access, indexing, or database recovery can survive a regional incident. Production evidence needs both views: service health from Kubernetes and DALP observability, plus chain health from RPC, DIDX lag, finality, and reorg signals.

Consortium networks

In a consortium network, hot-hot means several active regions participate in the operating model. Each region runs DALP services, RPC access, PostgreSQL, and any validator infrastructure required by the target network design.

Rendering diagram...

Recovery targets

Metric	Target	Notes
RTO	1 to 10 minutes	Traffic management shifts users away from an unhealthy region.
RPO	Seconds to minutes	Depends on database replication lag and the final failover procedure.
Recovery test time	10 to 60 minutes	Includes health checks, traffic rerouting, and operator validation.

Setup and maintenance

Task	Time estimate	Client role
Four-cluster provisioning	1 to 2 days	Platform engineer
Network connectivity, peering, or VPN	1 to 2 days	Network engineer
CloudNativePG setup across clusters	1 to 2 days	Platform engineer
PostgreSQL distributed topology	2 to 3 days	DBA or platform engineer
Failover automation and testing	2 to 3 days	Platform engineer
End-to-end DR drill	1 to 2 days	Platform team
Initial setup	3 to 5 weeks	2 to 3 client engineers

Activity	Frequency	Time per cycle
Cross-cluster replication monitoring	Daily	30 minutes
Backup verification across clusters	Weekly	2 hours
Helm chart updates across clusters	Monthly	4 to 8 hours
DR drill or failover test	Quarterly	1 to 2 days
Security patching across clusters	Monthly	1 to 2 days
Monthly effort		40 to 60 hours

Plan for 1.5 to 2 FTE platform engineers, DBA support, and a 24/7 on-call rotation. The cost is justified only when active regions and low failover time matter more than operating simplicity.

Public EVM networks

For public EVM networks, DALP does not operate the chain validators. The public chain remains the source of truth. DALP keeps user-facing services available across regions and uses RPC plus DIDX indexing to read and rebuild chain-derived state.

Rendering diagram...

What changes from consortium hot-hot

The operator does not manage validators for the public chain.
Indexed data can be rebuilt by replaying chain data through DIDX.
Regional failover focuses on service health, RPC reachability, DIDX sync, and PostgreSQL availability.
Recovery evidence should include indexed-state checks, not only Kubernetes pod health.

Recovery targets

Scenario	RTO	RPO	Notes
Single pod failure	Less than 1 minute	0	Kubernetes reschedules automatically.
Database failover	1 to 5 minutes	Seconds	CloudNativePG or the managed database service promotes a healthy replica.
Cluster failover	1 to 10 minutes	1 to 5 minutes	Traffic shifts to a healthy cluster after health checks fail.
Full re-index required	5 to 60 minutes	Not applicable	Timing depends on chain size, RPC throughput, and DIDX backlog.

Setup and maintenance

Task	Time estimate	Client role
Two-cluster provisioning	1 day	Platform engineer
CloudNativePG setup across clusters	1 day	Platform engineer
DIDX setup	1 to 2 days	Platform engineer
Global traffic management	4 to 8 hours	Platform engineer
Initial setup	1.5 to 2 weeks	1 to 2 client engineers

Activity	Frequency	Time per cycle
Replication-lag monitoring	Daily	15 minutes
DIDX sync verification	Daily	15 minutes
DR drill or failover test	Quarterly	4 to 8 hours
Security patching across clusters	Monthly	4 to 8 hours
Monthly effort		20 to 30 hours

Plan for 0.5 to 1 FTE platform engineer. The model is lighter than consortium hot-hot because the public chain owns consensus, but operators still need clear ownership for RPC health, indexing lag, database promotion, and traffic failover.

Operating checks before production

Before running DALP in hot-hot mode, verify:

traffic management can remove a failed region without sending users to a partially healthy DALP stack;
PostgreSQL promotion, backup restore, and point-in-time recovery are tested for the chosen managed or CloudNativePG topology;
DIDX sync, handler errors, and backfill progress are monitored for every active public-network region;
DR drills include application checks, database checks, chain/RPC checks, and user-facing route checks;
one incident owner can decide when to drain a region, promote a database, or rebuild indexed state.

Use observability for DIDX and runtime alerting, and backup and recovery for restore-test evidence.

Hot-hot active-active HA

Choose the correct hot-hot variant

Hybrid multi-region deployments

Supported cloud provider pattern

Regional cloud outage behaviour

Consortium networks

Recovery targets

Setup and maintenance

Public EVM networks

What changes from consortium hot-hot

Recovery targets

Setup and maintenance

Operating checks before production

On this page