SettleMint
ArchitectureSelf-HostingHigh Availability

Backup and recovery

Backup scope, recovery dependencies, PostgreSQL point-in-time recovery, namespace snapshots, monitoring signals, and disaster recovery drills for self-hosted DALP deployments.

Self-hosted DALP recovery depends on five surfaces: the database, Kubernetes resources, object storage, observability data, and configuration history. Platform operators verify the full set before treating an environment as production ready.

This page is a recovery reference for self-hosted deployments, not an SLA.

System context

Rendering diagram...

The recovery boundary spans live DALP services, their state stores, the Kubernetes namespace, configuration history, observability data, and the external EVM networks used to reconcile restored state. Test recovery in an isolated environment before routing clients back to the restored stack.

Recovery boundary

DALP provides chart-level backup resources and deployment guidance, but the recovery promise belongs to the operated environment. Recovery time objective and recovery point objective values depend on the selected infrastructure, object storage, PostgreSQL setup, restore automation, and runbook staffing. Set those targets for the deployment, then prove them through restore tests.

Do not publish an RTO or RPO as an external commitment until the same target has passed a drill using the target backup location, database restore path, object storage configuration, and route-switch procedure.

What the DALP chart contributes

When backups are enabled, the DALP chart can create the Velero backup storage location and schedule for the release namespace. The default schedule is daily at 02:00, targets DALP-labelled resources, includes persistent volumes through filesystem backup, excludes Kubernetes event resources, and derives the Velero TTL from the configured retention period.

These chart resources give operators a repeatable backup mechanism. They do not prove disaster recovery on their own.

Production evidence still needs each restored surface to work: the database, Kubernetes resources, object storage, application health checks, and reconciliation against the relevant EVM networks.

What gets backed up

ComponentBackup methodFrequencyRetentionRecovery purpose
PostgreSQL dataManaged PITR or CNPG WAL shipping to object storageContinuous30 daysRestore application state to a selected point in time.
Kubernetes resourcesVelero backups, with snapshots when availableHourly/Daily/Weekly48h/7d/30dRecreate namespace resources after cluster loss or drift.
Object storageBucket versioningAutomatic90 daysRecover files, backups, and exported artifacts.
Observability dataVelero backups when self-hostedDaily3 daysPreserve enough telemetry for incident review.
ConfigurationHelm values in GitEach committed values updateIndefiniteRebuild the same deployment shape after an outage.

Chart-backed backup resources

The DALP chart can create Velero backup resources when backup.enabled is set. The chart configures a BackupStorageLocation and, when scheduled backups are enabled, a Velero Schedule for the release namespace plus any configured additional namespaces.

Chart settingDefaultRecovery meaning
backup.enabledfalseBackup resources are opt-in and require a Velero-compatible environment.
backup.storage.providers3Backup storage can use S3-compatible storage, AWS S3, Azure Blob, or GCS.
backup.schedule.cron0 2 * * *The DALP chart schedule runs daily at 02:00 when scheduled backups are enabled.
backup.retention.days30Velero backup TTL is derived from this value.
backup.includeAllPVCstrueVelero uses filesystem backup for volumes in the included namespace set.
backup.labelSelector.matchLabels.kots.io/app-slugsettlemint-dalpBackups select DALP-labelled resources instead of every resource in the cluster.
backup.schedule.pausedfalseOperators can pause the schedule without deleting the backup definition.

The support chart also carries a Velero schedule for platform support backups. In that chart, the default schedule runs every 4 hours with a 7-day TTL and excludes Kubernetes event resources. Treat the application chart and support chart as separate backup surfaces when you test recovery.

PostgreSQL PITR

For CloudNativePG deployments:

  • WAL shipping to object storage is continuous.
  • Base backups run daily.
  • Point-in-time recovery can restore to a moment within the retention window.

Velero can use CSI snapshots when a compatible CSI driver and VolumeSnapshot CRDs are installed. Without that support, Velero performs file-level backups.

Recovery checks

Run restore tests against an isolated environment, not the production namespace. A usable drill confirms these facts:

  1. PostgreSQL restores to the selected timestamp inside the PITR window.
  2. Kubernetes resources restore with the expected secrets, config maps, services, ingress, and persistent volumes.
  3. Object storage data required by the restored environment is present at the expected version.
  4. DALP services start against the restored database and configuration.
  5. Operators record the achieved RTO and RPO, then compare them with the deployment target.
Rendering diagram...

Treat the diagram as the minimum drill loop. A restore that stops at database recovery is incomplete until DALP services start, indexed state is reconciled with chain state, and the measured recovery time is recorded against the target.

If a restore test needs manual changes that are not in Git or the runbook, treat the drill as failed until the missing step is documented and repeated.

Restore evidence to keep

EvidenceWhy it matters
Backup name and creation timeShows which recovery point was used.
Restore target timestampLets operators compare the intended RPO with the achieved restore point.
Database checkpoint or WAL endConfirms the database restored to the expected point before services started.
Restored namespace inventoryConfirms workloads, services, secrets, ingress, and persistent volumes exist.
Object version checkConfirms uploaded files and exported artifacts match the restored environment.
Application health checksConfirms DALP services can read the restored state and serve traffic.
Reconciliation resultConfirms indexed, off-chain, and on-chain state are consistent enough to run.

For Velero filesystem backups, include the pod resources in the restore. Restoring only persistent volume claims can recreate empty volumes because the node agent downloads filesystem data when the restored pods run the restore wait flow.

Monitoring

Key metrics to monitor

  • Pod availability for application health.
  • Pod restart counts for service stability.
  • PostgreSQL replication lag for data consistency.
  • Backup success and failure counts for recoverability.
  • Certificate expiration for TLS continuity.
  • Treat any backup failure as critical.
  • Warn when replication lag exceeds 60 seconds.
  • Warn when pod restarts exceed 5 per hour.
  • Warn when a certificate expires in less than 14 days.
  • Warn when disk usage exceeds 80%.

DR testing requirements

Quarterly DR drills

  1. Restore from backup to a test environment.
  2. Verify data integrity.
  3. Test application functionality.
  4. Document recovery time.
  5. Update runbooks if needed.

Annual full DR test

  1. Simulate cluster failure.
  2. Execute the full recovery procedure.
  3. Measure actual RTO/RPO against the deployment targets.
  4. Report the result to the accountable operating team.
  5. Update the SLA or operating target if needed.

On this page