SettleMint
ArchitectureSelf-HostingHigh Availability

Hot-hot active-active HA

Compare DALP hot-hot and hybrid multi-region deployment patterns for consortium and public EVM networks, including provider patterns, outage behaviour, recovery targets, and when to choose this model.

Hot-hot is DALP's active-active availability pattern. More than one region can serve traffic at the same time, so the design reduces user-facing failover time but raises the operating burden for traffic routing, data consistency, indexed-state reconciliation, and incident ownership.

Choose hot-hot only after comparing the simpler HA patterns: cloud-native, hot-warm, and hot-cold. For many self-hosted deployments, cloud-native multi-zone HA or hot-warm is enough.

Choose the correct hot-hot variant

DALP uses the same active-active idea in two different operating models:

VariantUse whenRecovery modelMain operator burden
Consortium networkYou operate validators or validator-adjacent infrastructure across regions.Keep validators, RPC access, DALP services, and PostgreSQL topology healthy across clusters.Consensus participation, validator placement, cross-cluster database design, and regional traffic routing.
Public EVM networkThe chain is external and DALP reads on-chain truth through RPC and DIDX indexing.Shift user traffic to a healthy cluster and rebuild indexed state from the chain when needed.RPC availability, DIDX sync health, database failover, and reconciliation after cluster failover.

Hybrid multi-region deployments

DALP supports hybrid deployments where core platform services remain in an on-premises Kubernetes or OpenShift estate while blockchain access, RPC nodes, validator-adjacent infrastructure, or DIDX indexing capacity runs in cloud regions. The supported cloud families are AWS, Azure, and GCP when the selected regions provide the required managed Kubernetes, PostgreSQL, Redis, object storage, backup, and observability services. SettleMint confirms the exact provider regions during deployment planning because region availability belongs to the selected cloud account and regulatory boundary.

Rendering diagram...

Treat the hybrid split as an operating boundary, not only as a network diagram. Each side needs clear ownership, health checks, credentials, route failover, and recovery evidence.

SurfaceWhat can be split across estatesWhat must stay aligned
DALP servicesdApp, API, workers, ingress, and observability can run in the primary application cluster or on-premises estate.Chart values, secrets, PostgreSQL connectivity, Redis connectivity, object storage access, and route health.
Blockchain nodesConsortium deployments can run validators and RPC nodes in separate regions or clusters when the network design supports regional node placement. Public-network deployments use external RPC access instead of operating public-chain validators.Chain ID, genesis or network configuration, finality assumptions, RPC authentication, provider limits, and failover runbooks.
Chain IndexerDIDX can run beside the DALP services or in a cloud estate with RPC access. It can rebuild chain-derived state from the chain and PostgreSQL checkpoints.Per-chain checkpoints, block lag, reorg handling, registered contract coverage, and indexed-state validation after failover.
Data servicesPostgreSQL, cache, object storage, and backups can use managed cloud services or approved self-hosted services.HA mode, replication lag, backup retention, restore access, and tested application reconnection.

Supported cloud provider pattern

Use AWS, Azure, or GCP regions that meet the self-hosting prerequisites. DALP does not require a fixed region list. The deployment must use region pairs or recovery regions approved by the operator, the cloud account, and the data-residency requirement.

Provider familyCloud services used in the patternRegion requirement
AWSEKS or OpenShift, RDS PostgreSQL Multi-AZ, ElastiCache Multi-AZ, S3, CloudWatch, Managed Prometheus, and Managed Grafana.Choose primary and recovery regions where these services are available and approved for the deployment.
AzureAKS or OpenShift, Azure Database for PostgreSQL Flexible Server with zone-redundant HA, Azure Cache for Redis, Blob Storage, Azure Monitor, and Managed Grafana.Choose primary and recovery regions where these services are available and approved for the deployment.
GCPGKE or OpenShift, Cloud SQL Regional HA, Memorystore Standard tier, Cloud Storage, Cloud Monitoring, and Cloud Logging.Choose primary and recovery regions where these services are available and approved for the deployment.

Regional cloud outage behaviour

During a cloud-region outage, DALP can continue operating through the remaining healthy region only for the surfaces that have been deployed and tested in that second region. If the cloud-hosted node or indexer is single-region, the on-premises DALP estate can stay up, but chain reads, chain writes, and indexed-state freshness depend on restoring RPC and DIDX access.

SurfaceFailover behaviourRTO expectationRPO expectation
RPC nodes or external RPCRoute DALP to the healthy RPC endpoint or provider region after health checks fail.1 to 10 minutesSeconds to minutes for endpoint freshness; 0 for on-chain state because the EVM chain remains authoritative.
Consortium validatorsSurviving validators keep the network healthy only when the consensus design tolerates the failed region.1 to 10 minutesSeconds to minutes, depending on finality and database replication lag.
DIDX indexer with healthy RPCThe indexer resumes from the last checkpoint and catches up from chain data.1 to 10 minutesSeconds to minutes for checkpointed indexed state.
DIDX full rebuildRebuild indexed state from the chain when checkpointed state or the indexed database cannot be trusted.5 to 60 minutesNot applicable to on-chain truth.
On-premises DALP servicesThe application estate stays available if its database, cache, routes, and secrets remain healthy.1 to 10 minutesSeconds to minutes, depending on database, cache, and route failover state.

Do not treat a healthy blockchain node as proof that the application estate is healthy. Do not treat a healthy application pod as proof that RPC access, indexing, or database recovery can survive a regional incident. Production evidence needs both views: service health from Kubernetes and DALP observability, plus chain health from RPC, DIDX lag, finality, and reorg signals.

Consortium networks

In a consortium network, hot-hot means several active regions participate in the operating model. Each region runs DALP services, RPC access, PostgreSQL, and any validator infrastructure required by the target network design.

Rendering diagram...

Recovery targets

MetricTargetNotes
RTO1 to 10 minutesTraffic management shifts users away from an unhealthy region.
RPOSeconds to minutesDepends on database replication lag and the final failover procedure.
Recovery test time10 to 60 minutesIncludes health checks, traffic rerouting, and operator validation.

Setup and maintenance

TaskTime estimateClient role
Four-cluster provisioning1 to 2 daysPlatform engineer
Network connectivity, peering, or VPN1 to 2 daysNetwork engineer
CloudNativePG setup across clusters1 to 2 daysPlatform engineer
PostgreSQL distributed topology2 to 3 daysDBA or platform engineer
Failover automation and testing2 to 3 daysPlatform engineer
End-to-end DR drill1 to 2 daysPlatform team
Initial setup3 to 5 weeks2 to 3 client engineers
ActivityFrequencyTime per cycle
Cross-cluster replication monitoringDaily30 minutes
Backup verification across clustersWeekly2 hours
Helm chart updates across clustersMonthly4 to 8 hours
DR drill or failover testQuarterly1 to 2 days
Security patching across clustersMonthly1 to 2 days
Monthly effort40 to 60 hours

Plan for 1.5 to 2 FTE platform engineers, DBA support, and a 24/7 on-call rotation. The cost is justified only when active regions and low failover time matter more than operating simplicity.

Public EVM networks

For public EVM networks, DALP does not operate the chain validators. The public chain remains the source of truth. DALP keeps user-facing services available across regions and uses RPC plus DIDX indexing to read and rebuild chain-derived state.

Rendering diagram...

What changes from consortium hot-hot

  • The operator does not manage validators for the public chain.
  • Indexed data can be rebuilt by replaying chain data through DIDX.
  • Regional failover focuses on service health, RPC reachability, DIDX sync, and PostgreSQL availability.
  • Recovery evidence should include indexed-state checks, not only Kubernetes pod health.

Recovery targets

ScenarioRTORPONotes
Single pod failureLess than 1 minute0Kubernetes reschedules automatically.
Database failover1 to 5 minutesSecondsCloudNativePG or the managed database service promotes a healthy replica.
Cluster failover1 to 10 minutes1 to 5 minutesTraffic shifts to a healthy cluster after health checks fail.
Full re-index required5 to 60 minutesNot applicableTiming depends on chain size, RPC throughput, and DIDX backlog.

Setup and maintenance

TaskTime estimateClient role
Two-cluster provisioning1 dayPlatform engineer
CloudNativePG setup across clusters1 dayPlatform engineer
DIDX setup1 to 2 daysPlatform engineer
Global traffic management4 to 8 hoursPlatform engineer
Initial setup1.5 to 2 weeks1 to 2 client engineers
ActivityFrequencyTime per cycle
Replication-lag monitoringDaily15 minutes
DIDX sync verificationDaily15 minutes
DR drill or failover testQuarterly4 to 8 hours
Security patching across clustersMonthly4 to 8 hours
Monthly effort20 to 30 hours

Plan for 0.5 to 1 FTE platform engineer. The model is lighter than consortium hot-hot because the public chain owns consensus, but operators still need clear ownership for RPC health, indexing lag, database promotion, and traffic failover.

Operating checks before production

Before running DALP in hot-hot mode, verify:

  • traffic management can remove a failed region without sending users to a partially healthy DALP stack;
  • PostgreSQL promotion, backup restore, and point-in-time recovery are tested for the chosen managed or CloudNativePG topology;
  • DIDX sync, handler errors, and backfill progress are monitored for every active public-network region;
  • DR drills include application checks, database checks, chain/RPC checks, and user-facing route checks;
  • one incident owner can decide when to drain a region, promote a database, or rebuild indexed state.

Use observability for DIDX and runtime alerting, and backup and recovery for restore-test evidence.

On this page