Hot-hot active-active HA
Compare DALP hot-hot and hybrid multi-region deployment patterns for consortium and public EVM networks, including provider patterns, outage behaviour, recovery targets, and when to choose this model.
Hot-hot is DALP's active-active availability pattern. More than one region can serve traffic at the same time, so the design reduces user-facing failover time but raises the operating burden for traffic routing, data consistency, indexed-state reconciliation, and incident ownership.
Choose hot-hot only after comparing the simpler HA patterns: cloud-native, hot-warm, and hot-cold. For many self-hosted deployments, cloud-native multi-zone HA or hot-warm is enough.
Choose the correct hot-hot variant
DALP uses the same active-active idea in two different operating models:
| Variant | Use when | Recovery model | Main operator burden |
|---|---|---|---|
| Consortium network | You operate validators or validator-adjacent infrastructure across regions. | Keep validators, RPC access, DALP services, and PostgreSQL topology healthy across clusters. | Consensus participation, validator placement, cross-cluster database design, and regional traffic routing. |
| Public EVM network | The chain is external and DALP reads on-chain truth through RPC and DIDX indexing. | Shift user traffic to a healthy cluster and rebuild indexed state from the chain when needed. | RPC availability, DIDX sync health, database failover, and reconciliation after cluster failover. |
Hybrid multi-region deployments
DALP supports hybrid deployments where core platform services remain in an on-premises Kubernetes or OpenShift estate while blockchain access, RPC nodes, validator-adjacent infrastructure, or DIDX indexing capacity runs in cloud regions. The supported cloud families are AWS, Azure, and GCP when the selected regions provide the required managed Kubernetes, PostgreSQL, Redis, object storage, backup, and observability services. SettleMint confirms the exact provider regions during deployment planning because region availability belongs to the selected cloud account and regulatory boundary.
Treat the hybrid split as an operating boundary, not only as a network diagram. Each side needs clear ownership, health checks, credentials, route failover, and recovery evidence.
| Surface | What can be split across estates | What must stay aligned |
|---|---|---|
| DALP services | dApp, API, workers, ingress, and observability can run in the primary application cluster or on-premises estate. | Chart values, secrets, PostgreSQL connectivity, Redis connectivity, object storage access, and route health. |
| Blockchain nodes | Consortium deployments can run validators and RPC nodes in separate regions or clusters when the network design supports regional node placement. Public-network deployments use external RPC access instead of operating public-chain validators. | Chain ID, genesis or network configuration, finality assumptions, RPC authentication, provider limits, and failover runbooks. |
| Chain Indexer | DIDX can run beside the DALP services or in a cloud estate with RPC access. It can rebuild chain-derived state from the chain and PostgreSQL checkpoints. | Per-chain checkpoints, block lag, reorg handling, registered contract coverage, and indexed-state validation after failover. |
| Data services | PostgreSQL, cache, object storage, and backups can use managed cloud services or approved self-hosted services. | HA mode, replication lag, backup retention, restore access, and tested application reconnection. |
Supported cloud provider pattern
Use AWS, Azure, or GCP regions that meet the self-hosting prerequisites. DALP does not require a fixed region list. The deployment must use region pairs or recovery regions approved by the operator, the cloud account, and the data-residency requirement.
| Provider family | Cloud services used in the pattern | Region requirement |
|---|---|---|
| AWS | EKS or OpenShift, RDS PostgreSQL Multi-AZ, ElastiCache Multi-AZ, S3, CloudWatch, Managed Prometheus, and Managed Grafana. | Choose primary and recovery regions where these services are available and approved for the deployment. |
| Azure | AKS or OpenShift, Azure Database for PostgreSQL Flexible Server with zone-redundant HA, Azure Cache for Redis, Blob Storage, Azure Monitor, and Managed Grafana. | Choose primary and recovery regions where these services are available and approved for the deployment. |
| GCP | GKE or OpenShift, Cloud SQL Regional HA, Memorystore Standard tier, Cloud Storage, Cloud Monitoring, and Cloud Logging. | Choose primary and recovery regions where these services are available and approved for the deployment. |
Regional cloud outage behaviour
During a cloud-region outage, DALP can continue operating through the remaining healthy region only for the surfaces that have been deployed and tested in that second region. If the cloud-hosted node or indexer is single-region, the on-premises DALP estate can stay up, but chain reads, chain writes, and indexed-state freshness depend on restoring RPC and DIDX access.
| Surface | Failover behaviour | RTO expectation | RPO expectation |
|---|---|---|---|
| RPC nodes or external RPC | Route DALP to the healthy RPC endpoint or provider region after health checks fail. | 1 to 10 minutes | Seconds to minutes for endpoint freshness; 0 for on-chain state because the EVM chain remains authoritative. |
| Consortium validators | Surviving validators keep the network healthy only when the consensus design tolerates the failed region. | 1 to 10 minutes | Seconds to minutes, depending on finality and database replication lag. |
| DIDX indexer with healthy RPC | The indexer resumes from the last checkpoint and catches up from chain data. | 1 to 10 minutes | Seconds to minutes for checkpointed indexed state. |
| DIDX full rebuild | Rebuild indexed state from the chain when checkpointed state or the indexed database cannot be trusted. | 5 to 60 minutes | Not applicable to on-chain truth. |
| On-premises DALP services | The application estate stays available if its database, cache, routes, and secrets remain healthy. | 1 to 10 minutes | Seconds to minutes, depending on database, cache, and route failover state. |
Do not treat a healthy blockchain node as proof that the application estate is healthy. Do not treat a healthy application pod as proof that RPC access, indexing, or database recovery can survive a regional incident. Production evidence needs both views: service health from Kubernetes and DALP observability, plus chain health from RPC, DIDX lag, finality, and reorg signals.
Consortium networks
In a consortium network, hot-hot means several active regions participate in the operating model. Each region runs DALP services, RPC access, PostgreSQL, and any validator infrastructure required by the target network design.
Recovery targets
| Metric | Target | Notes |
|---|---|---|
| RTO | 1 to 10 minutes | Traffic management shifts users away from an unhealthy region. |
| RPO | Seconds to minutes | Depends on database replication lag and the final failover procedure. |
| Recovery test time | 10 to 60 minutes | Includes health checks, traffic rerouting, and operator validation. |
Setup and maintenance
| Task | Time estimate | Client role |
|---|---|---|
| Four-cluster provisioning | 1 to 2 days | Platform engineer |
| Network connectivity, peering, or VPN | 1 to 2 days | Network engineer |
| CloudNativePG setup across clusters | 1 to 2 days | Platform engineer |
| PostgreSQL distributed topology | 2 to 3 days | DBA or platform engineer |
| Failover automation and testing | 2 to 3 days | Platform engineer |
| End-to-end DR drill | 1 to 2 days | Platform team |
| Initial setup | 3 to 5 weeks | 2 to 3 client engineers |
| Activity | Frequency | Time per cycle |
|---|---|---|
| Cross-cluster replication monitoring | Daily | 30 minutes |
| Backup verification across clusters | Weekly | 2 hours |
| Helm chart updates across clusters | Monthly | 4 to 8 hours |
| DR drill or failover test | Quarterly | 1 to 2 days |
| Security patching across clusters | Monthly | 1 to 2 days |
| Monthly effort | 40 to 60 hours |
Plan for 1.5 to 2 FTE platform engineers, DBA support, and a 24/7 on-call rotation. The cost is justified only when active regions and low failover time matter more than operating simplicity.
Public EVM networks
For public EVM networks, DALP does not operate the chain validators. The public chain remains the source of truth. DALP keeps user-facing services available across regions and uses RPC plus DIDX indexing to read and rebuild chain-derived state.
What changes from consortium hot-hot
- The operator does not manage validators for the public chain.
- Indexed data can be rebuilt by replaying chain data through DIDX.
- Regional failover focuses on service health, RPC reachability, DIDX sync, and PostgreSQL availability.
- Recovery evidence should include indexed-state checks, not only Kubernetes pod health.
Recovery targets
| Scenario | RTO | RPO | Notes |
|---|---|---|---|
| Single pod failure | Less than 1 minute | 0 | Kubernetes reschedules automatically. |
| Database failover | 1 to 5 minutes | Seconds | CloudNativePG or the managed database service promotes a healthy replica. |
| Cluster failover | 1 to 10 minutes | 1 to 5 minutes | Traffic shifts to a healthy cluster after health checks fail. |
| Full re-index required | 5 to 60 minutes | Not applicable | Timing depends on chain size, RPC throughput, and DIDX backlog. |
Setup and maintenance
| Task | Time estimate | Client role |
|---|---|---|
| Two-cluster provisioning | 1 day | Platform engineer |
| CloudNativePG setup across clusters | 1 day | Platform engineer |
| DIDX setup | 1 to 2 days | Platform engineer |
| Global traffic management | 4 to 8 hours | Platform engineer |
| Initial setup | 1.5 to 2 weeks | 1 to 2 client engineers |
| Activity | Frequency | Time per cycle |
|---|---|---|
| Replication-lag monitoring | Daily | 15 minutes |
| DIDX sync verification | Daily | 15 minutes |
| DR drill or failover test | Quarterly | 4 to 8 hours |
| Security patching across clusters | Monthly | 4 to 8 hours |
| Monthly effort | 20 to 30 hours |
Plan for 0.5 to 1 FTE platform engineer. The model is lighter than consortium hot-hot because the public chain owns consensus, but operators still need clear ownership for RPC health, indexing lag, database promotion, and traffic failover.
Operating checks before production
Before running DALP in hot-hot mode, verify:
- traffic management can remove a failed region without sending users to a partially healthy DALP stack;
- PostgreSQL promotion, backup restore, and point-in-time recovery are tested for the chosen managed or CloudNativePG topology;
- DIDX sync, handler errors, and backfill progress are monitored for every active public-network region;
- DR drills include application checks, database checks, chain/RPC checks, and user-facing route checks;
- one incident owner can decide when to drain a region, promote a database, or rebuild indexed state.
Use observability for DIDX and runtime alerting, and backup and recovery for restore-test evidence.
Hot-cold backup recovery
Use hot-cold disaster recovery when a self-hosted DALP deployment can accept restore-based recovery, backup-dependent RPO, and multi-hour RTO in exchange for a lower standby cost.
Backup and recovery
Backup scope, recovery dependencies, PostgreSQL point-in-time recovery, namespace snapshots, monitoring signals, and disaster recovery drills for self-hosted DALP deployments.