Hot-cold backup recovery
Use hot-cold disaster recovery when a self-hosted DALP deployment can accept restore-based recovery, backup-dependent RPO, and multi-hour RTO in exchange for a lower standby cost.
Related pages: High availability overview, Hot-warm active-standby, Backup and recovery, and Self-hosting prerequisites
Hot-cold recovery keeps one DALP environment active. Operators rebuild the DALP service in a recovery environment from backups, infrastructure-as-code, and a tested restore runbook. Choose hot-cold only when the deployment can tolerate a backup-dependent recovery point and recovery measured in hours.
Hot-cold is a restore pattern, not a live failover pattern. Do not use it for production financial workloads that need near-zero data loss, automatic failover, or recovery measured in minutes.
When hot-cold fits
Hot-cold fits environments where cost matters more than fast recovery. Use it only when these conditions are true:
- the environment is development, staging, sandbox, or non-critical production-adjacent
- a multi-hour outage is acceptable during a regional incident
- application state can be restored from PostgreSQL backups and reconciled against the chain
- the operator can rebuild the DALP namespace, secrets, ingress, database, object storage access, monitoring, and RPC configuration from versioned runbooks
Use cloud-native HA for the default self-hosted production baseline. Use hot-warm active-standby when the recovery region must already contain a warm DALP stack and database replica.
Architecture
The active environment serves traffic and writes application state. Backup jobs preserve PostgreSQL data, Kubernetes resources, object storage data, observability data, and configuration history. During an incident, the operator provisions or activates the recovery environment, restores the latest approved backup set, points DALP services at the restored dependencies, and validates chain-facing workflows before reopening service.
Recovery metrics
Hot-cold targets are deployment-specific. Treat the numbers below as planning ranges that must be proven with drills.
| Metric | Planning range | What drives the result |
|---|---|---|
| RTO | 8 to 72 hours | Cluster provisioning, restore speed, image availability, secrets access, DNS or ingress changes, validation |
| RPO | 4 to 24 hours | PostgreSQL backup frequency, WAL retention, object-storage replication, and the last successful backup check |
| RTT | 12 to 96 hours | Full restore, chain reconciliation, indexer catch-up, downstream checks, and incident closure evidence |
Do not publish an RTO or RPO commitment from this pattern alone. The commitment belongs to the deployment, its providers, and the tested operating procedure.
Restore sequence
- Declare the incident and stop writes to the affected environment when it is still reachable.
- Select the latest backup set that satisfies the deployment recovery point target.
- Provision or activate the recovery Kubernetes or OpenShift environment.
- Restore PostgreSQL to the selected point in time.
- Restore Kubernetes resources, secrets, configuration, object storage access, and observability components needed by DALP.
- Start DALP services against the restored database and dependency configuration.
- Reconnect RPC endpoints, custody or signing dependencies, and monitoring routes.
- Validate login, API availability, asset reads, transaction submission, event indexing, and audit exports.
- Record the achieved RTO and RPO, then update the runbook if any manual step was missing or slower than expected.
If the environment cannot be rebuilt from Git, backups, and the approved secret-management process, the hot-cold plan is not ready.
What must be backed up
| Surface | Recovery expectation |
|---|---|
| PostgreSQL | Restore application state to a selected point in time through managed PITR or WAL-backed backups. |
| Kubernetes resources | Recreate namespace resources, services, ingress, config maps, secrets references, and persistent data. |
| Object storage | Recover files, exported artefacts, backup payloads, and storage configuration needed by the services. |
| Configuration | Reapply Helm values, environment configuration, ingress settings, and network-specific RPC settings. |
| Observability | Preserve enough logs, metrics, traces, and alert state to review the incident and prove recovery. |
For backup scheduling, PITR expectations, and restore-test checks, use Backup and recovery.
Operator checks
Before approving hot-cold for an environment, confirm these checks are already true:
- the backup job writes to storage outside the failed cluster or failed region
- at least one full restore test has succeeded in an isolated environment
- database restore, namespace restore, object-storage access, and service startup are written as repeatable runbook steps
- secrets and key material can be restored through the approved secret-management process without copying secrets into documentation
- DNS, ingress, TLS certificates, RPC endpoints, and custody or signing provider access are included in the recovery checklist
- monitoring alerts cover backup failure, restore-test age, pod availability, database health, storage access, and RPC availability
- the incident owner knows when to keep the environment offline for reconciliation instead of reopening service quickly
Comparison with other patterns
| Pattern | Best fit | Recovery posture |
|---|---|---|
| Cloud-native | Standard self-hosted production baseline | Multi-zone application placement with managed-service HA. |
| Hot-warm | Regional recovery with a ready standby | Manual promotion of a warm region and replicated data. |
| Hot-cold | Lower-cost recovery for tolerant workloads | Rebuild from backups, runbooks, and provider configuration. |
| Hot-hot | Active-active regional service requirements | Multiple active regions with stronger consistency controls. |
Related pages
Hot-warm active-standby
Run one active DALP cluster with a warm standby cluster, replicated PostgreSQL data, staged validator operations, and a manual regional failover runbook.
Hot-hot active-active HA
Compare DALP hot-hot and hybrid multi-region deployment patterns for consortium and public EVM networks, including provider patterns, outage behaviour, recovery targets, and when to choose this model.