SettleMint
ArchitectureSelf-HostingHigh Availability

Hot-warm active-standby

Run one active DALP cluster with a warm standby cluster, replicated PostgreSQL data, staged validator operations, and a manual regional failover runbook.

Related pages: HA overview, Cloud-native HA, Hot-cold HA, Backup and recovery, Self-hosting prerequisites


A hot-warm deployment runs one active DALP cluster and keeps a second cluster ready for promotion. Use this pattern when one region serves production traffic and another region needs recovery without a full rebuild. The operating model assumes a manual failover window measured in tens of minutes.

Architecture

Rendering diagram...

The active cluster handles user traffic, RPC traffic, and validator operations. The standby cluster keeps infrastructure, secrets, images, and route configuration ready. Standby workloads do not serve production writes until failover. PostgreSQL replication moves application state from the active region to the standby region. Object storage, backups, and observability follow the choices in the self-hosting prerequisites.

Quickstart

Run these checks before you call a standby cluster warm. Replace the namespace and labels with the values used in your installation.

export ACTIVE_CONTEXT=dalp-active
export STANDBY_CONTEXT=dalp-standby
export DALP_NAMESPACE=dalp

kubectl --context "$ACTIVE_CONTEXT" -n "$DALP_NAMESPACE" get pods
kubectl --context "$STANDBY_CONTEXT" -n "$DALP_NAMESPACE" get pods
kubectl --context "$STANDBY_CONTEXT" -n "$DALP_NAMESPACE" get secrets

A healthy pre-failover result shows the active cluster serving workloads and the standby cluster holding the resources needed for promotion.

NAME                              READY   STATUS    RESTARTS   AGE
dalp-dapp-6fdbf8f6f8-abc12        1/1     Running   0          3d
dalp-dapi-7f9d7bcb7c-def34        1/1     Running   0          3d
postgresql-primary-1              1/1     Running   0          3d
NAME                              READY   STATUS    RESTARTS   AGE
dalp-dapp-6b87c7d45b-ghi56        1/1     Running   0          3d
dalp-dapi-66c76f8b74-jkl78        1/1     Running   0          3d
postgresql-replica-1              1/1     Running   0          3d

Treat this quickstart as an operator readiness check, not as a failover command. The standby workloads can be healthy without serving production writes. Actual failover changes database leadership, validator activity, and traffic routing. Run failover only through a controlled incident procedure.

When hot-warm fits

Use hot-warm when all of these conditions are true:

  • You need geographic recovery for a regional outage.
  • You accept an RTO of 30 to 180 minutes.
  • You accept an RPO of 5 to 60 minutes, depending on replication lag and backup posture.
  • You can keep trained operators available for manual promotion and validation.
  • You can pre-stage validator operations, DNS or traffic-manager changes, secrets, and observability in the standby region.

Do not use hot-warm as a substitute for multi-AZ high availability inside one region. Use cloud-native HA when the failure target is a node, availability zone, or managed database failover inside one region.

Recovery metrics

MetricTargetWhat drives the number
RTO30 to 180 minutesOperator availability, replica promotion, workload start time, DNS or traffic-manager change, and validation time
RPO5 to 60 minutesPostgreSQL replication lag, backup frequency, and object-storage replication posture
RTT1 to 6 hoursFailover execution, application validation, reconciliation checks, and rollback decision time

The platform does not make a manual failover automatic. Your runbook, staffing model, monitoring, and drills determine whether the deployment meets these targets in production.

Production requirements

RequirementProduction expectation
Two clustersRun separate Kubernetes or OpenShift clusters in separate failure domains. Keep cluster versions, namespaces, ingress, and chart configuration aligned.
PostgreSQL replicationUse managed cross-region replication or CloudNativePG replication, depending on whether PostgreSQL is managed or self-hosted. Monitor lag continuously.
BackupsUse provider-managed backups or Velero and CloudNativePG backups. Verify restore, not only backup creation.
Object storageUse managed object storage or RustFS with S3-compatible configuration. Align retention and replication with the recovery point objective.
Secrets and keysPre-stage required Kubernetes secrets and key material in the standby cluster through your approved secret-management process. Do not start duplicate signing or validator operations against production traffic.
Traffic managementKeep DNS, load balancer, or global traffic-manager changes documented and rehearsed. Set TTLs that match the expected failover window.
ObservabilityCollect metrics, logs, traces, and alerts from both clusters. Alert on standby health, replication lag, backup failures, and expired certificates.
Operator runbookKeep a dated runbook that names the decision owner, promotion steps, validation checks, rollback conditions, and communication path.

Manual failover sequence

  1. Declare the incident and freeze production writes if the active region still accepts traffic.
  2. Confirm the latest usable PostgreSQL replica or backup in the standby region.
  3. Promote the standby PostgreSQL replica according to your managed database or CloudNativePG procedure.
  4. Start the standby DALP workloads that depend on the promoted database.
  5. Enable standby validator operations and confirm the former active validators cannot produce duplicate signatures.
  6. Switch DNS, load balancer, or global traffic-manager routing to the standby cluster.
  7. Validate dApp routes, API routes, RPC access, validator health, observability, and audit evidence.
  8. Keep the former active region isolated until reconciliation confirms whether it can return as standby.

Each step needs a named owner and a stop condition. If PostgreSQL promotion, route validation, or validator health fails, stop the failover and follow the backup-recovery runbook instead of continuing with a partially promoted region.

Operational checks

CheckMinimum frequencyEvidence to keep
Replication lagDaily, plus alertingCurrent lag, threshold, and last healthy timestamp
Standby workload readinessDailyPod readiness, image versions, required secrets, and pending configuration drift
Backup restore testWeekly for critical dataRestore timestamp, restored object count or database checkpoint, and validation result
Certificate and DNS reviewMonthlyExpiry dates, DNS TTLs, and active routing target
Failover drillQuarterlyRTO, RPO, failed steps, owner, and remediation items
Security patchingMonthlyPatched cluster versions, operator versions, and workload image versions

A hot-warm design loses value when the standby region drifts. Treat drift as a production incident when drift blocks promotion, breaks recovery evidence, or leaves keys, secrets, certificates, or routes stale.

Compliance and audit notes

Availability and processing integrity depend on the failover record, not only on infrastructure. Keep evidence for the incident decision, replica or backup timestamp, data-loss assessment, operator actions, route switch, validation checks, and reconciliation result.

If your deployment handles regulated workloads, align the runbook with the availability, confidentiality, and processing-integrity controls that apply to the environment. Do not document an RTO or RPO commitment externally unless the same target is backed by drills and operational evidence.

Next steps

  • Use self-hosting prerequisites to choose managed or self-hosted PostgreSQL, object storage, observability, and backup services.
  • Use backup and recovery to define restore validation and recovery evidence.
  • Use hot-hot HA only when you need concurrent active regions and can operate the added consensus and traffic-routing complexity.

On this page