Hot-warm active-standby
Run one active DALP cluster with a warm standby cluster, replicated PostgreSQL data, staged validator operations, and a manual regional failover runbook.
Related pages: HA overview, Cloud-native HA, Hot-cold HA, Backup and recovery, Self-hosting prerequisites
A hot-warm deployment runs one active DALP cluster and keeps a second cluster ready for promotion. Use this pattern when one region serves production traffic and another region needs recovery without a full rebuild. The operating model assumes a manual failover window measured in tens of minutes.
Architecture
The active cluster handles user traffic, RPC traffic, and validator operations. The standby cluster keeps infrastructure, secrets, images, and route configuration ready. Standby workloads do not serve production writes until failover. PostgreSQL replication moves application state from the active region to the standby region. Object storage, backups, and observability follow the choices in the self-hosting prerequisites.
Quickstart
Run these checks before you call a standby cluster warm. Replace the namespace and labels with the values used in your installation.
export ACTIVE_CONTEXT=dalp-active
export STANDBY_CONTEXT=dalp-standby
export DALP_NAMESPACE=dalp
kubectl --context "$ACTIVE_CONTEXT" -n "$DALP_NAMESPACE" get pods
kubectl --context "$STANDBY_CONTEXT" -n "$DALP_NAMESPACE" get pods
kubectl --context "$STANDBY_CONTEXT" -n "$DALP_NAMESPACE" get secretsA healthy pre-failover result shows the active cluster serving workloads and the standby cluster holding the resources needed for promotion.
NAME READY STATUS RESTARTS AGE
dalp-dapp-6fdbf8f6f8-abc12 1/1 Running 0 3d
dalp-dapi-7f9d7bcb7c-def34 1/1 Running 0 3d
postgresql-primary-1 1/1 Running 0 3dNAME READY STATUS RESTARTS AGE
dalp-dapp-6b87c7d45b-ghi56 1/1 Running 0 3d
dalp-dapi-66c76f8b74-jkl78 1/1 Running 0 3d
postgresql-replica-1 1/1 Running 0 3dTreat this quickstart as an operator readiness check, not as a failover command. The standby workloads can be healthy without serving production writes. Actual failover changes database leadership, validator activity, and traffic routing. Run failover only through a controlled incident procedure.
When hot-warm fits
Use hot-warm when all of these conditions are true:
- You need geographic recovery for a regional outage.
- You accept an RTO of 30 to 180 minutes.
- You accept an RPO of 5 to 60 minutes, depending on replication lag and backup posture.
- You can keep trained operators available for manual promotion and validation.
- You can pre-stage validator operations, DNS or traffic-manager changes, secrets, and observability in the standby region.
Do not use hot-warm as a substitute for multi-AZ high availability inside one region. Use cloud-native HA when the failure target is a node, availability zone, or managed database failover inside one region.
Recovery metrics
| Metric | Target | What drives the number |
|---|---|---|
| RTO | 30 to 180 minutes | Operator availability, replica promotion, workload start time, DNS or traffic-manager change, and validation time |
| RPO | 5 to 60 minutes | PostgreSQL replication lag, backup frequency, and object-storage replication posture |
| RTT | 1 to 6 hours | Failover execution, application validation, reconciliation checks, and rollback decision time |
The platform does not make a manual failover automatic. Your runbook, staffing model, monitoring, and drills determine whether the deployment meets these targets in production.
Production requirements
| Requirement | Production expectation |
|---|---|
| Two clusters | Run separate Kubernetes or OpenShift clusters in separate failure domains. Keep cluster versions, namespaces, ingress, and chart configuration aligned. |
| PostgreSQL replication | Use managed cross-region replication or CloudNativePG replication, depending on whether PostgreSQL is managed or self-hosted. Monitor lag continuously. |
| Backups | Use provider-managed backups or Velero and CloudNativePG backups. Verify restore, not only backup creation. |
| Object storage | Use managed object storage or RustFS with S3-compatible configuration. Align retention and replication with the recovery point objective. |
| Secrets and keys | Pre-stage required Kubernetes secrets and key material in the standby cluster through your approved secret-management process. Do not start duplicate signing or validator operations against production traffic. |
| Traffic management | Keep DNS, load balancer, or global traffic-manager changes documented and rehearsed. Set TTLs that match the expected failover window. |
| Observability | Collect metrics, logs, traces, and alerts from both clusters. Alert on standby health, replication lag, backup failures, and expired certificates. |
| Operator runbook | Keep a dated runbook that names the decision owner, promotion steps, validation checks, rollback conditions, and communication path. |
Manual failover sequence
- Declare the incident and freeze production writes if the active region still accepts traffic.
- Confirm the latest usable PostgreSQL replica or backup in the standby region.
- Promote the standby PostgreSQL replica according to your managed database or CloudNativePG procedure.
- Start the standby DALP workloads that depend on the promoted database.
- Enable standby validator operations and confirm the former active validators cannot produce duplicate signatures.
- Switch DNS, load balancer, or global traffic-manager routing to the standby cluster.
- Validate dApp routes, API routes, RPC access, validator health, observability, and audit evidence.
- Keep the former active region isolated until reconciliation confirms whether it can return as standby.
Each step needs a named owner and a stop condition. If PostgreSQL promotion, route validation, or validator health fails, stop the failover and follow the backup-recovery runbook instead of continuing with a partially promoted region.
Operational checks
| Check | Minimum frequency | Evidence to keep |
|---|---|---|
| Replication lag | Daily, plus alerting | Current lag, threshold, and last healthy timestamp |
| Standby workload readiness | Daily | Pod readiness, image versions, required secrets, and pending configuration drift |
| Backup restore test | Weekly for critical data | Restore timestamp, restored object count or database checkpoint, and validation result |
| Certificate and DNS review | Monthly | Expiry dates, DNS TTLs, and active routing target |
| Failover drill | Quarterly | RTO, RPO, failed steps, owner, and remediation items |
| Security patching | Monthly | Patched cluster versions, operator versions, and workload image versions |
A hot-warm design loses value when the standby region drifts. Treat drift as a production incident when drift blocks promotion, breaks recovery evidence, or leaves keys, secrets, certificates, or routes stale.
Compliance and audit notes
Availability and processing integrity depend on the failover record, not only on infrastructure. Keep evidence for the incident decision, replica or backup timestamp, data-loss assessment, operator actions, route switch, validation checks, and reconciliation result.
If your deployment handles regulated workloads, align the runbook with the availability, confidentiality, and processing-integrity controls that apply to the environment. Do not document an RTO or RPO commitment externally unless the same target is backed by drills and operational evidence.
Next steps
- Use self-hosting prerequisites to choose managed or self-hosted PostgreSQL, object storage, observability, and backup services.
- Use backup and recovery to define restore validation and recovery evidence.
- Use hot-hot HA only when you need concurrent active regions and can operate the added consensus and traffic-routing complexity.
Cloud-native high availability
Use managed Kubernetes, managed data services, multi-zone placement, health probes, and backup tooling as the default high availability pattern for self-hosted DALP deployments.
Hot-cold backup recovery
Use hot-cold disaster recovery when a self-hosted DALP deployment can accept restore-based recovery, backup-dependent RPO, and multi-hour RTO in exchange for a lower standby cost.