Backup and recovery
Backup scope, recovery dependencies, PostgreSQL point-in-time recovery, namespace snapshots, monitoring signals, and disaster recovery drills for self-hosted DALP deployments.
Self-hosted DALP recovery depends on five surfaces: the database, Kubernetes resources, object storage, observability data, and configuration history. Platform operators verify the full set before treating an environment as production ready.
This page is a recovery reference for self-hosted deployments, not an SLA.
System context
The recovery boundary spans live DALP services, their state stores, the Kubernetes namespace, configuration history, observability data, and the external EVM networks used to reconcile restored state. Test recovery in an isolated environment before routing clients back to the restored stack.
Recovery boundary
DALP provides chart-level backup resources and deployment guidance, but the recovery promise belongs to the operated environment. Recovery time objective and recovery point objective values depend on the selected infrastructure, object storage, PostgreSQL setup, restore automation, and runbook staffing. Set those targets for the deployment, then prove them through restore tests.
Do not publish an RTO or RPO as an external commitment until the same target has passed a drill using the target backup location, database restore path, object storage configuration, and route-switch procedure.
What the DALP chart contributes
When backups are enabled, the DALP chart can create the Velero backup storage location and schedule for the release namespace. The default schedule is daily at 02:00, targets DALP-labelled resources, includes persistent volumes through filesystem backup, excludes Kubernetes event resources, and derives the Velero TTL from the configured retention period.
These chart resources give operators a repeatable backup mechanism. They do not prove disaster recovery on their own.
Production evidence still needs each restored surface to work: the database, Kubernetes resources, object storage, application health checks, and reconciliation against the relevant EVM networks.
What gets backed up
| Component | Backup method | Frequency | Retention | Recovery purpose |
|---|---|---|---|---|
| PostgreSQL data | Managed PITR or CNPG WAL shipping to object storage | Continuous | 30 days | Restore application state to a selected point in time. |
| Kubernetes resources | Velero backups, with snapshots when available | Hourly/Daily/Weekly | 48h/7d/30d | Recreate namespace resources after cluster loss or drift. |
| Object storage | Bucket versioning | Automatic | 90 days | Recover files, backups, and exported artifacts. |
| Observability data | Velero backups when self-hosted | Daily | 3 days | Preserve enough telemetry for incident review. |
| Configuration | Helm values in Git | Each committed values update | Indefinite | Rebuild the same deployment shape after an outage. |
Chart-backed backup resources
The DALP chart can create Velero backup resources when backup.enabled is set. The chart configures a BackupStorageLocation and, when scheduled backups are enabled, a Velero Schedule for the release namespace plus any configured additional namespaces.
| Chart setting | Default | Recovery meaning |
|---|---|---|
backup.enabled | false | Backup resources are opt-in and require a Velero-compatible environment. |
backup.storage.provider | s3 | Backup storage can use S3-compatible storage, AWS S3, Azure Blob, or GCS. |
backup.schedule.cron | 0 2 * * * | The DALP chart schedule runs daily at 02:00 when scheduled backups are enabled. |
backup.retention.days | 30 | Velero backup TTL is derived from this value. |
backup.includeAllPVCs | true | Velero uses filesystem backup for volumes in the included namespace set. |
backup.labelSelector.matchLabels.kots.io/app-slug | settlemint-dalp | Backups select DALP-labelled resources instead of every resource in the cluster. |
backup.schedule.paused | false | Operators can pause the schedule without deleting the backup definition. |
The support chart also carries a Velero schedule for platform support backups. In that chart, the default schedule runs every 4 hours with a 7-day TTL and excludes Kubernetes event resources. Treat the application chart and support chart as separate backup surfaces when you test recovery.
PostgreSQL PITR
For CloudNativePG deployments:
- WAL shipping to object storage is continuous.
- Base backups run daily.
- Point-in-time recovery can restore to a moment within the retention window.
Velero can use CSI snapshots when a compatible CSI driver and VolumeSnapshot CRDs are installed. Without that support, Velero performs file-level backups.
Recovery checks
Run restore tests against an isolated environment, not the production namespace. A usable drill confirms these facts:
- PostgreSQL restores to the selected timestamp inside the PITR window.
- Kubernetes resources restore with the expected secrets, config maps, services, ingress, and persistent volumes.
- Object storage data required by the restored environment is present at the expected version.
- DALP services start against the restored database and configuration.
- Operators record the achieved RTO and RPO, then compare them with the deployment target.
Treat the diagram as the minimum drill loop. A restore that stops at database recovery is incomplete until DALP services start, indexed state is reconciled with chain state, and the measured recovery time is recorded against the target.
If a restore test needs manual changes that are not in Git or the runbook, treat the drill as failed until the missing step is documented and repeated.
Restore evidence to keep
| Evidence | Why it matters |
|---|---|
| Backup name and creation time | Shows which recovery point was used. |
| Restore target timestamp | Lets operators compare the intended RPO with the achieved restore point. |
| Database checkpoint or WAL end | Confirms the database restored to the expected point before services started. |
| Restored namespace inventory | Confirms workloads, services, secrets, ingress, and persistent volumes exist. |
| Object version check | Confirms uploaded files and exported artifacts match the restored environment. |
| Application health checks | Confirms DALP services can read the restored state and serve traffic. |
| Reconciliation result | Confirms indexed, off-chain, and on-chain state are consistent enough to run. |
For Velero filesystem backups, include the pod resources in the restore. Restoring only persistent volume claims can recreate empty volumes because the node agent downloads filesystem data when the restored pods run the restore wait flow.
Monitoring
Key metrics to monitor
- Pod availability for application health.
- Pod restart counts for service stability.
- PostgreSQL replication lag for data consistency.
- Backup success and failure counts for recoverability.
- Certificate expiration for TLS continuity.
Recommended alerts
- Treat any backup failure as critical.
- Warn when replication lag exceeds 60 seconds.
- Warn when pod restarts exceed 5 per hour.
- Warn when a certificate expires in less than 14 days.
- Warn when disk usage exceeds 80%.
DR testing requirements
Quarterly DR drills
- Restore from backup to a test environment.
- Verify data integrity.
- Test application functionality.
- Document recovery time.
- Update runbooks if needed.
Annual full DR test
- Simulate cluster failure.
- Execute the full recovery procedure.
- Measure actual RTO/RPO against the deployment targets.
- Report the result to the accountable operating team.
- Update the SLA or operating target if needed.
Related pages
Hot-hot active-active HA
Compare DALP hot-hot and hybrid multi-region deployment patterns for consortium and public EVM networks, including provider patterns, outage behaviour, recovery targets, and when to choose this model.
User guides
Console-led operating guides for administrators and operations teams running regulated digital asset workflows on DALP.