# Hot-warm active-standby

Source: https://docs.settlemint.com/docs/architecture/self-hosting/high-availability/hot-warm
Run one active DALP cluster with a warm standby cluster, replicated PostgreSQL data, staged validator operations, and a manual regional failover runbook.


Related pages: [HA overview](/docs/architecture/self-hosting/high-availability), [Cloud-native HA](/docs/architecture/self-hosting/high-availability/cloud-native), [Hot-cold HA](/docs/architecture/self-hosting/high-availability/hot-cold), [Backup and recovery](/docs/architecture/self-hosting/high-availability/backup-recovery), [Self-hosting prerequisites](/docs/architecture/self-hosting/prerequisites)

***

A hot-warm deployment runs one active DALP cluster and keeps a second cluster ready for promotion. Use this pattern when one region serves production traffic and another region needs recovery without a full rebuild. The operating model assumes a manual failover window measured in tens of minutes.

## Architecture [#architecture]

<Mermaid
  chart="`flowchart TB
  gslb(Traffic management)
  subgraph active[&#x22;Active cluster A&#x22;]
    av(Live validators)
    arpc(Active RPC or dApp routes)
    apg(Primary PostgreSQL)
    av --> arpc
  end
  subgraph standby[&#x22;Warm standby cluster B&#x22;]
    wv(Warm validators with staged operations)
    wrpc(Warm RPC or dApp routes)
    rpg(Replica PostgreSQL)
    wv --> wrpc
  end
  gslb -->|Production traffic| active
  apg -->|Continuous replication| rpg
  subgraph failover[&#x22;Manual failover sequence&#x22;]
    f1(1. Stop writes in cluster A)
    f2(2. Promote PostgreSQL replica)
    f3(3. Start standby DALP workloads)
    f4(4. Switch DNS or traffic)
    f5(5. Validate quorum and routes)
    f1 --> f2 --> f3 --> f4 --> f5
  end
`"
/>

The active cluster handles user traffic, RPC traffic, and validator operations. The standby cluster keeps infrastructure, secrets, images, and route configuration ready. Standby workloads do not serve production writes until failover. PostgreSQL replication moves application state from the active region to the standby region. Object storage, backups, and observability follow the choices in the self-hosting prerequisites.

## Quickstart [#quickstart]

Run these checks before you call a standby cluster warm. Replace the namespace and labels with the values used in your installation.

```bash
export ACTIVE_CONTEXT=dalp-active
export STANDBY_CONTEXT=dalp-standby
export DALP_NAMESPACE=dalp

kubectl --context "$ACTIVE_CONTEXT" -n "$DALP_NAMESPACE" get pods
kubectl --context "$STANDBY_CONTEXT" -n "$DALP_NAMESPACE" get pods
kubectl --context "$STANDBY_CONTEXT" -n "$DALP_NAMESPACE" get secrets
```

A healthy pre-failover result shows the active cluster serving workloads and the standby cluster holding the resources needed for promotion.

```text
NAME                              READY   STATUS    RESTARTS   AGE
dalp-dapp-6fdbf8f6f8-abc12        1/1     Running   0          3d
dalp-dapi-7f9d7bcb7c-def34        1/1     Running   0          3d
postgresql-primary-1              1/1     Running   0          3d
```

```text
NAME                              READY   STATUS    RESTARTS   AGE
dalp-dapp-6b87c7d45b-ghi56        1/1     Running   0          3d
dalp-dapi-66c76f8b74-jkl78        1/1     Running   0          3d
postgresql-replica-1              1/1     Running   0          3d
```

Treat this quickstart as an operator readiness check, not as a failover command. The standby workloads can be healthy without serving production writes. Actual failover changes database leadership, validator activity, and traffic routing. Run failover only through a controlled incident procedure.

## When hot-warm fits [#when-hot-warm-fits]

Use hot-warm when all of these conditions are true:

* You need geographic recovery for a regional outage.
* You accept an RTO of 30 to 180 minutes.
* You accept an RPO of 5 to 60 minutes, depending on replication lag and backup posture.
* You can keep trained operators available for manual promotion and validation.
* You can pre-stage validator operations, DNS or traffic-manager changes, secrets, and observability in the standby region.

Do not use hot-warm as a substitute for multi-AZ high availability inside one region. Use [cloud-native HA](/docs/architecture/self-hosting/high-availability/cloud-native) when the failure target is a node, availability zone, or managed database failover inside one region.

## Recovery metrics [#recovery-metrics]

| Metric | Target            | What drives the number                                                                                            |
| ------ | ----------------- | ----------------------------------------------------------------------------------------------------------------- |
| RTO    | 30 to 180 minutes | Operator availability, replica promotion, workload start time, DNS or traffic-manager change, and validation time |
| RPO    | 5 to 60 minutes   | PostgreSQL replication lag, backup frequency, and object-storage replication posture                              |
| RTT    | 1 to 6 hours      | Failover execution, application validation, reconciliation checks, and rollback decision time                     |

The platform does not make a manual failover automatic. Your runbook, staffing model, monitoring, and drills determine whether the deployment meets these targets in production.

## Production requirements [#production-requirements]

| Requirement            | Production expectation                                                                                                                                                                                            |
| ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Two clusters           | Run separate Kubernetes or OpenShift clusters in separate failure domains. Keep cluster versions, namespaces, ingress, and chart configuration aligned.                                                           |
| PostgreSQL replication | Use managed cross-region replication or CloudNativePG replication, depending on whether PostgreSQL is managed or self-hosted. Monitor lag continuously.                                                           |
| Backups                | Use provider-managed backups or Velero and CloudNativePG backups. Verify restore, not only backup creation.                                                                                                       |
| Object storage         | Use managed object storage or RustFS with S3-compatible configuration. Align retention and replication with the recovery point objective.                                                                         |
| Secrets and keys       | Pre-stage required Kubernetes secrets and key material in the standby cluster through your approved secret-management process. Do not start duplicate signing or validator operations against production traffic. |
| Traffic management     | Keep DNS, load balancer, or global traffic-manager changes documented and rehearsed. Set TTLs that match the expected failover window.                                                                            |
| Observability          | Collect metrics, logs, traces, and alerts from both clusters. Alert on standby health, replication lag, backup failures, and expired certificates.                                                                |
| Operator runbook       | Keep a dated runbook that names the decision owner, promotion steps, validation checks, rollback conditions, and communication path.                                                                              |

## Manual failover sequence [#manual-failover-sequence]

1. Declare the incident and freeze production writes if the active region still accepts traffic.
2. Confirm the latest usable PostgreSQL replica or backup in the standby region.
3. Promote the standby PostgreSQL replica according to your managed database or CloudNativePG procedure.
4. Start the standby DALP workloads that depend on the promoted database.
5. Enable standby validator operations and confirm the former active validators cannot produce duplicate signatures.
6. Switch DNS, load balancer, or global traffic-manager routing to the standby cluster.
7. Validate dApp routes, API routes, RPC access, validator health, observability, and audit evidence.
8. Keep the former active region isolated until reconciliation confirms whether it can return as standby.

Each step needs a named owner and a stop condition. If PostgreSQL promotion, route validation, or validator health fails, stop the failover and follow the backup-recovery runbook instead of continuing with a partially promoted region.

## Operational checks [#operational-checks]

| Check                      | Minimum frequency        | Evidence to keep                                                                       |
| -------------------------- | ------------------------ | -------------------------------------------------------------------------------------- |
| Replication lag            | Daily, plus alerting     | Current lag, threshold, and last healthy timestamp                                     |
| Standby workload readiness | Daily                    | Pod readiness, image versions, required secrets, and pending configuration drift       |
| Backup restore test        | Weekly for critical data | Restore timestamp, restored object count or database checkpoint, and validation result |
| Certificate and DNS review | Monthly                  | Expiry dates, DNS TTLs, and active routing target                                      |
| Failover drill             | Quarterly                | RTO, RPO, failed steps, owner, and remediation items                                   |
| Security patching          | Monthly                  | Patched cluster versions, operator versions, and workload image versions               |

A hot-warm design loses value when the standby region drifts. Treat drift as a production incident when drift blocks promotion, breaks recovery evidence, or leaves keys, secrets, certificates, or routes stale.

## Compliance and audit notes [#compliance-and-audit-notes]

Availability and processing integrity depend on the failover record, not only on infrastructure. Keep evidence for the incident decision, replica or backup timestamp, data-loss assessment, operator actions, route switch, validation checks, and reconciliation result.

If your deployment handles regulated workloads, align the runbook with the availability, confidentiality, and processing-integrity controls that apply to the environment. Do not document an RTO or RPO commitment externally unless the same target is backed by drills and operational evidence.

## Next steps [#next-steps]

* Use [self-hosting prerequisites](/docs/architecture/self-hosting/prerequisites) to choose managed or self-hosted PostgreSQL, object storage, observability, and backup services.
* Use [backup and recovery](/docs/architecture/self-hosting/high-availability/backup-recovery) to define restore validation and recovery evidence.
* Use [hot-hot HA](/docs/architecture/self-hosting/high-availability/hot-hot) only when you need concurrent active regions and can operate the added consensus and traffic-routing complexity.