DALP hot-warm HA: active-standby with manual failover

Run one active DALP cluster with a warm standby cluster, replicated PostgreSQL data, staged validator operations, and a manual regional failover runbook.

A hot-warm deployment runs one active DALP cluster and keeps a second cluster ready for promotion. Use this pattern when one region serves production traffic and another region needs recovery without a full rebuild. The operating model assumes a manual failover window measured in tens of minutes.

Architecture

Rendering diagram...

The active cluster handles user traffic, RPC traffic, and validator operations. The standby cluster keeps infrastructure, secrets, images, and route configuration ready. Standby workloads do not serve production writes until failover. PostgreSQL replication moves application state from the active region to the standby region. Object storage, backups, and observability follow the choices in the self-hosting prerequisites.

Quickstart

Run these checks before you call a standby cluster warm. Replace the namespace and labels with the values used in your installation.

export ACTIVE_CONTEXT=dalp-active
export STANDBY_CONTEXT=dalp-standby
export DALP_NAMESPACE=dalp

kubectl --context "$ACTIVE_CONTEXT" -n "$DALP_NAMESPACE" get pods
kubectl --context "$STANDBY_CONTEXT" -n "$DALP_NAMESPACE" get pods
kubectl --context "$STANDBY_CONTEXT" -n "$DALP_NAMESPACE" get secrets

A healthy pre-failover result shows the active cluster serving workloads and the standby cluster holding the resources needed for promotion.

NAME                              READY   STATUS    RESTARTS   AGE
dalp-dapp-6fdbf8f6f8-abc12        1/1     Running   0          3d
dalp-dapi-7f9d7bcb7c-def34        1/1     Running   0          3d
postgresql-primary-1              1/1     Running   0          3d

NAME                              READY   STATUS    RESTARTS   AGE
dalp-dapp-6b87c7d45b-ghi56        1/1     Running   0          3d
dalp-dapi-66c76f8b74-jkl78        1/1     Running   0          3d
postgresql-replica-1              1/1     Running   0          3d

Treat this quickstart as an operator readiness check, not as a failover command. The standby workloads can be healthy without serving production writes. Actual failover changes database leadership, validator activity, and traffic routing. Run failover only through a controlled incident procedure.

When hot-warm fits

Use hot-warm when all of these conditions are true:

You need geographic recovery for a regional outage.
You accept an RTO of 30 to 180 minutes.
You accept an RPO of 5 to 60 minutes, depending on replication lag and backup posture.
You can keep trained operators available for manual promotion and validation.
You can pre-stage validator operations, DNS or traffic-manager changes, secrets, and observability in the standby region.

Do not use hot-warm as a substitute for multi-AZ high availability inside one region. Use cloud-native HA when the failure target is a node, availability zone, or managed database failover inside one region.

Recovery metrics

Metric	Target	What drives the number
RTO	30 to 180 minutes	Operator availability, replica promotion, workload start time, DNS or traffic-manager change, and validation time
RPO	5 to 60 minutes	PostgreSQL replication lag, backup frequency, and object-storage replication posture
RTT	1 to 6 hours	Failover execution, application validation, reconciliation checks, and rollback decision time

The platform does not make a manual failover automatic. Your runbook, staffing model, monitoring, and drills determine whether the deployment meets these targets in production.

Production requirements

Requirement	Production expectation
Two clusters	Run separate Kubernetes or OpenShift clusters in separate failure domains. Keep cluster versions, namespaces, ingress, and chart configuration aligned.
PostgreSQL replication	Use managed cross-region replication or CloudNativePG replication, depending on whether PostgreSQL is managed or self-hosted. Monitor lag continuously.
Backups	Use provider-managed backups or Velero and CloudNativePG backups. Verify restore, not only backup creation.
Object storage	Use managed object storage or RustFS with S3-compatible configuration. Align retention and replication with the recovery point objective.
Secrets and keys	Pre-stage required Kubernetes secrets and key material in the standby cluster through your approved secret-management process. Do not start duplicate signing or validator operations against production traffic.
Traffic management	Keep DNS, load balancer, or global traffic-manager changes documented and rehearsed. Set TTLs that match the expected failover window.
Observability	Collect metrics, logs, traces, and alerts from both clusters. Alert on standby health, replication lag, backup failures, and expired certificates.
Operator runbook	Keep a dated runbook that names the decision owner, promotion steps, validation checks, rollback conditions, and communication path.

Manual failover sequence

Declare the incident and freeze production writes if the active region still accepts traffic.
Confirm the latest usable PostgreSQL replica or backup in the standby region.
Promote the standby PostgreSQL replica according to your managed database or CloudNativePG procedure.
Start the standby DALP workloads that depend on the promoted database.
Enable standby validator operations and confirm the former active validators cannot produce duplicate signatures.
Switch DNS, load balancer, or global traffic-manager routing to the standby cluster.
Validate dApp routes, API routes, RPC access, validator health, observability, and audit evidence.
Keep the former active region isolated until reconciliation confirms whether it can return as standby.

Each step needs a named owner and a stop condition. If PostgreSQL promotion, route validation, or validator health fails, stop the failover and follow the backup-recovery runbook instead of continuing with a partially promoted region.

Operational checks

Check	Minimum frequency	Evidence to keep
Replication lag	Daily, plus alerting	Current lag, threshold, and last healthy timestamp
Standby workload readiness	Daily	Pod readiness, image versions, required secrets, and pending configuration drift
Backup restore test	Weekly for critical data	Restore timestamp, restored object count or database checkpoint, and validation result
Certificate and DNS review	Monthly	Expiry dates, DNS TTLs, and active routing target
Failover drill	Quarterly	RTO, RPO, failed steps, owner, and remediation items
Security patching	Monthly	Patched cluster versions, operator versions, and workload image versions

A hot-warm design loses value when the standby region drifts. Treat drift as a production incident when drift blocks promotion, breaks recovery evidence, or leaves keys, secrets, certificates, or routes stale.

Compliance and audit notes

Availability and processing integrity depend on the failover record, not only on infrastructure. Keep evidence for the incident decision, replica or backup timestamp, data-loss assessment, operator actions, route switch, validation checks, and reconciliation result.

If your deployment handles regulated workloads, align the runbook with the availability, confidentiality, and processing-integrity controls that apply to the environment. Do not document an RTO or RPO commitment externally unless the same target is backed by drills and operational evidence.

Next steps

Use self-hosting prerequisites to choose managed or self-hosted PostgreSQL, object storage, observability, and backup services.
Use backup and recovery to define restore validation and recovery evidence.
Use hot-hot HA only when you need concurrent active regions and can operate the added consensus and traffic-routing complexity.

Hot-warm active-standby