DALP hot-cold high availability: backup-based recovery

Use hot-cold disaster recovery when a self-hosted DALP deployment can accept restore-based recovery, backup-dependent RPO, and multi-hour RTO in exchange for a lower standby cost.

Hot-cold recovery keeps one DALP environment active. Operators rebuild the DALP service in a recovery environment from backups, infrastructure-as-code, and a tested restore runbook. Choose hot-cold only when the deployment can tolerate a backup-dependent recovery point and recovery measured in hours.

Hot-cold is a restore pattern, not a live failover pattern. Do not use it for production financial workloads that need near-zero data loss, automatic failover, or recovery measured in minutes.

When hot-cold fits

Hot-cold fits environments where cost matters more than fast recovery. Use it only when these conditions are true:

the environment is development, staging, sandbox, or non-critical production-adjacent
a multi-hour outage is acceptable during a regional incident
application state can be restored from PostgreSQL backups and reconciled against the chain
the operator can rebuild the DALP namespace, secrets, ingress, database, object storage access, monitoring, and RPC configuration from versioned runbooks

Use cloud-native HA for the default self-hosted production baseline. Use hot-warm active-standby when the recovery region must already contain a warm DALP stack and database replica.

Architecture

Rendering diagram...

The active environment serves traffic and writes application state. Backup jobs preserve PostgreSQL data, Kubernetes resources, object storage data, observability data, and configuration history. During an incident, the operator provisions or activates the recovery environment, restores the latest approved backup set, points DALP services at the restored dependencies, and validates chain-facing workflows before reopening service.

Recovery metrics

Hot-cold targets are deployment-specific. Treat the numbers below as planning ranges that must be proven with drills.

Metric	Planning range	What drives the result
RTO	8 to 72 hours	Cluster provisioning, restore speed, image availability, secrets access, DNS or ingress changes, validation
RPO	4 to 24 hours	PostgreSQL backup frequency, WAL retention, object-storage replication, and the last successful backup check
RTT	12 to 96 hours	Full restore, chain reconciliation, indexer catch-up, downstream checks, and incident closure evidence

Do not publish an RTO or RPO commitment from this pattern alone. The commitment belongs to the deployment, its providers, and the tested operating procedure.

Restore sequence

Declare the incident and stop writes to the affected environment when it is still reachable.
Select the latest backup set that satisfies the deployment recovery point target.
Provision or activate the recovery Kubernetes or OpenShift environment.
Restore PostgreSQL to the selected point in time.
Restore Kubernetes resources, secrets, configuration, object storage access, and observability components needed by DALP.
Start DALP services against the restored database and dependency configuration.
Reconnect RPC endpoints, custody or signing dependencies, and monitoring routes.
Validate login, API availability, asset reads, transaction submission, event indexing, and audit exports.
Record the achieved RTO and RPO, then update the runbook if any manual step was missing or slower than expected.

If the environment cannot be rebuilt from Git, backups, and the approved secret-management process, the hot-cold plan is not ready.

What must be backed up

Surface	Recovery expectation
PostgreSQL	Restore application state to a selected point in time through managed PITR or WAL-backed backups.
Kubernetes resources	Recreate namespace resources, services, ingress, config maps, secrets references, and persistent data.
Object storage	Recover files, exported artefacts, backup payloads, and storage configuration needed by the services.
Configuration	Reapply Helm values, environment configuration, ingress settings, and network-specific RPC settings.
Observability	Preserve enough logs, metrics, traces, and alert state to review the incident and prove recovery.

For backup scheduling, PITR expectations, and restore-test checks, use Backup and recovery.

Operator checks

Before approving hot-cold for an environment, confirm these checks are already true:

the backup job writes to storage outside the failed cluster or failed region
at least one full restore test has succeeded in an isolated environment
database restore, namespace restore, object-storage access, and service startup are written as repeatable runbook steps
secrets and key material can be restored through the approved secret-management process without copying secrets into documentation
DNS, ingress, TLS certificates, RPC endpoints, and custody or signing provider access are included in the recovery checklist
monitoring alerts cover backup failure, restore-test age, pod availability, database health, storage access, and RPC availability
the incident owner knows when to keep the environment offline for reconciliation instead of reopening service quickly

Comparison with other patterns

Pattern	Best fit	Recovery posture
Cloud-native	Standard self-hosted production baseline	Multi-zone application placement with managed-service HA.
Hot-warm	Regional recovery with a ready standby	Manual promotion of a warm region and replicated data.
Hot-cold	Lower-cost recovery for tolerant workloads	Rebuild from backups, runbooks, and provider configuration.
Hot-hot	Active-active regional service requirements	Multiple active regions with stronger consistency controls.

Hot-cold backup recovery