SettleMint
ArchitectureSelf-HostingHigh Availability

Hot-cold backup recovery

Use hot-cold disaster recovery when a self-hosted DALP deployment can accept restore-based recovery, backup-dependent RPO, and multi-hour RTO in exchange for a lower standby cost.

Related pages: High availability overview, Hot-warm active-standby, Backup and recovery, and Self-hosting prerequisites


Hot-cold recovery keeps one DALP environment active. Operators rebuild the DALP service in a recovery environment from backups, infrastructure-as-code, and a tested restore runbook. Choose hot-cold only when the deployment can tolerate a backup-dependent recovery point and recovery measured in hours.

Hot-cold is a restore pattern, not a live failover pattern. Do not use it for production financial workloads that need near-zero data loss, automatic failover, or recovery measured in minutes.

When hot-cold fits

Hot-cold fits environments where cost matters more than fast recovery. Use it only when these conditions are true:

  • the environment is development, staging, sandbox, or non-critical production-adjacent
  • a multi-hour outage is acceptable during a regional incident
  • application state can be restored from PostgreSQL backups and reconciled against the chain
  • the operator can rebuild the DALP namespace, secrets, ingress, database, object storage access, monitoring, and RPC configuration from versioned runbooks

Use cloud-native HA for the default self-hosted production baseline. Use hot-warm active-standby when the recovery region must already contain a warm DALP stack and database replica.

Architecture

Rendering diagram...

The active environment serves traffic and writes application state. Backup jobs preserve PostgreSQL data, Kubernetes resources, object storage data, observability data, and configuration history. During an incident, the operator provisions or activates the recovery environment, restores the latest approved backup set, points DALP services at the restored dependencies, and validates chain-facing workflows before reopening service.

Recovery metrics

Hot-cold targets are deployment-specific. Treat the numbers below as planning ranges that must be proven with drills.

MetricPlanning rangeWhat drives the result
RTO8 to 72 hoursCluster provisioning, restore speed, image availability, secrets access, DNS or ingress changes, validation
RPO4 to 24 hoursPostgreSQL backup frequency, WAL retention, object-storage replication, and the last successful backup check
RTT12 to 96 hoursFull restore, chain reconciliation, indexer catch-up, downstream checks, and incident closure evidence

Do not publish an RTO or RPO commitment from this pattern alone. The commitment belongs to the deployment, its providers, and the tested operating procedure.

Restore sequence

  1. Declare the incident and stop writes to the affected environment when it is still reachable.
  2. Select the latest backup set that satisfies the deployment recovery point target.
  3. Provision or activate the recovery Kubernetes or OpenShift environment.
  4. Restore PostgreSQL to the selected point in time.
  5. Restore Kubernetes resources, secrets, configuration, object storage access, and observability components needed by DALP.
  6. Start DALP services against the restored database and dependency configuration.
  7. Reconnect RPC endpoints, custody or signing dependencies, and monitoring routes.
  8. Validate login, API availability, asset reads, transaction submission, event indexing, and audit exports.
  9. Record the achieved RTO and RPO, then update the runbook if any manual step was missing or slower than expected.

If the environment cannot be rebuilt from Git, backups, and the approved secret-management process, the hot-cold plan is not ready.

What must be backed up

SurfaceRecovery expectation
PostgreSQLRestore application state to a selected point in time through managed PITR or WAL-backed backups.
Kubernetes resourcesRecreate namespace resources, services, ingress, config maps, secrets references, and persistent data.
Object storageRecover files, exported artefacts, backup payloads, and storage configuration needed by the services.
ConfigurationReapply Helm values, environment configuration, ingress settings, and network-specific RPC settings.
ObservabilityPreserve enough logs, metrics, traces, and alert state to review the incident and prove recovery.

For backup scheduling, PITR expectations, and restore-test checks, use Backup and recovery.

Operator checks

Before approving hot-cold for an environment, confirm these checks are already true:

  • the backup job writes to storage outside the failed cluster or failed region
  • at least one full restore test has succeeded in an isolated environment
  • database restore, namespace restore, object-storage access, and service startup are written as repeatable runbook steps
  • secrets and key material can be restored through the approved secret-management process without copying secrets into documentation
  • DNS, ingress, TLS certificates, RPC endpoints, and custody or signing provider access are included in the recovery checklist
  • monitoring alerts cover backup failure, restore-test age, pod availability, database health, storage access, and RPC availability
  • the incident owner knows when to keep the environment offline for reconciliation instead of reopening service quickly

Comparison with other patterns

PatternBest fitRecovery posture
Cloud-nativeStandard self-hosted production baselineMulti-zone application placement with managed-service HA.
Hot-warmRegional recovery with a ready standbyManual promotion of a warm region and replicated data.
Hot-coldLower-cost recovery for tolerant workloadsRebuild from backups, runbooks, and provider configuration.
Hot-hotActive-active regional service requirementsMultiple active regions with stronger consistency controls.

On this page