# Backup and recovery

Source: https://docs.settlemint.com/docs/architecture/self-hosting/high-availability/backup-recovery
Backup scope, recovery dependencies, PostgreSQL point-in-time recovery, namespace snapshots, monitoring signals, and disaster recovery drills for self-hosted DALP deployments.



Self-hosted DALP recovery depends on five surfaces: the database, Kubernetes resources, object storage, observability data, and configuration history. Platform operators verify the full set before treating an environment as production ready.

This page is a recovery reference for self-hosted deployments, not an SLA.

## System context [#system-context]

<Mermaid
  chart="`flowchart TB
  operators[Platform operators]
  clients[API clients and dapp users]
  edge[Ingress and routing]
  dalp[DALP services\nAPI, dapp, workers, indexer]
  postgres[(PostgreSQL\napplication state and PITR)]
  object[(Object storage\nfiles, backups, exports)]
  kube[(Kubernetes resources\nsecrets, config maps, volumes)]
  observability[(Observability data\nmetrics, logs, traces)]
  config[Configuration history\nHelm values and runbooks]
  chain[EVM networks and RPC endpoints]
  restore[Isolated recovery environment]
  evidence[Recovery evidence\nRTO, RPO, reconciliation]

  clients --> edge
  edge --> dalp
  dalp --> postgres
  dalp --> object
  dalp --> kube
  dalp --> observability
  dalp --> chain
  config --> dalp
  operators --> config
  postgres --> restore
  object --> restore
  kube --> restore
  observability --> restore
  config --> restore
  chain --> restore
  restore --> evidence
  operators --> restore

`"
/>

The recovery boundary spans live DALP services, their state stores, the Kubernetes namespace, configuration history, observability data, and the external EVM networks used to reconcile restored state. Test recovery in an isolated environment before routing clients back to the restored stack.

## Recovery boundary [#recovery-boundary]

DALP provides chart-level backup resources and deployment guidance, but the recovery promise belongs to the operated environment. Recovery time objective and recovery point objective values depend on the selected infrastructure, object storage, PostgreSQL setup, restore automation, and runbook staffing. Set those targets for the deployment, then prove them through restore tests.

Do not publish an RTO or RPO as an external commitment until the same target has passed a drill using the target backup location, database restore path, object storage configuration, and route-switch procedure.

## What the DALP chart contributes [#what-the-dalp-chart-contributes]

When backups are enabled, the DALP chart can create the Velero backup storage location and schedule for the release namespace. The default schedule is daily at 02:00, targets DALP-labelled resources, includes persistent volumes through filesystem backup, excludes Kubernetes event resources, and derives the Velero TTL from the configured retention period.

These chart resources give operators a repeatable backup mechanism. They do not prove disaster recovery on their own.

Production evidence still needs each restored surface to work: the database, Kubernetes resources, object storage, application health checks, and reconciliation against the relevant EVM networks.

## What gets backed up [#what-gets-backed-up]

| Component            | Backup method                                       | Frequency                    | Retention  | Recovery purpose                                          |
| -------------------- | --------------------------------------------------- | ---------------------------- | ---------- | --------------------------------------------------------- |
| PostgreSQL data      | Managed PITR or CNPG WAL shipping to object storage | Continuous                   | 30 days    | Restore application state to a selected point in time.    |
| Kubernetes resources | Velero backups, with snapshots when available       | Hourly/Daily/Weekly          | 48h/7d/30d | Recreate namespace resources after cluster loss or drift. |
| Object storage       | Bucket versioning                                   | Automatic                    | 90 days    | Recover files, backups, and exported artifacts.           |
| Observability data   | Velero backups when self-hosted                     | Daily                        | 3 days     | Preserve enough telemetry for incident review.            |
| Configuration        | Helm values in Git                                  | Each committed values update | Indefinite | Rebuild the same deployment shape after an outage.        |

## Chart-backed backup resources [#chart-backed-backup-resources]

The DALP chart can create Velero backup resources when `backup.enabled` is set. The chart configures a `BackupStorageLocation` and, when scheduled backups are enabled, a Velero `Schedule` for the release namespace plus any configured additional namespaces.

| Chart setting                                       | Default           | Recovery meaning                                                                 |
| --------------------------------------------------- | ----------------- | -------------------------------------------------------------------------------- |
| `backup.enabled`                                    | `false`           | Backup resources are opt-in and require a Velero-compatible environment.         |
| `backup.storage.provider`                           | `s3`              | Backup storage can use S3-compatible storage, AWS S3, Azure Blob, or GCS.        |
| `backup.schedule.cron`                              | `0 2 * * *`       | The DALP chart schedule runs daily at 02:00 when scheduled backups are enabled.  |
| `backup.retention.days`                             | `30`              | Velero backup TTL is derived from this value.                                    |
| `backup.includeAllPVCs`                             | `true`            | Velero uses filesystem backup for volumes in the included namespace set.         |
| `backup.labelSelector.matchLabels.kots.io/app-slug` | `settlemint-dalp` | Backups select DALP-labelled resources instead of every resource in the cluster. |
| `backup.schedule.paused`                            | `false`           | Operators can pause the schedule without deleting the backup definition.         |

The support chart also carries a Velero schedule for platform support backups. In that chart, the default schedule runs every 4 hours with a 7-day TTL and excludes Kubernetes event resources. Treat the application chart and support chart as separate backup surfaces when you test recovery.

## PostgreSQL PITR [#postgresql-pitr]

For CloudNativePG deployments:

* WAL shipping to object storage is continuous.
* Base backups run daily.
* Point-in-time recovery can restore to a moment within the retention window.

Velero can use CSI snapshots when a compatible CSI driver and VolumeSnapshot CRDs are installed. Without that support, Velero performs file-level backups.

## Recovery checks [#recovery-checks]

Run restore tests against an isolated environment, not the production namespace. A usable drill confirms these facts:

1. PostgreSQL restores to the selected timestamp inside the PITR window.
2. Kubernetes resources restore with the expected secrets, config maps, services, ingress, and persistent volumes.
3. Object storage data required by the restored environment is present at the expected version.
4. DALP services start against the restored database and configuration.
5. Operators record the achieved RTO and RPO, then compare them with the deployment target.

<Mermaid
  chart="`flowchart LR
  target[Recovery target\nRTO and RPO] --> backup[Backup source\nPITR, Velero, object versions]
  backup --> restore[Isolated restore\nDatabase, namespace, storage]
  restore --> app[DALP health checks\nAPI, dapp, workers, indexer]
  app --> reconcile[State reconciliation\nDatabase, indexed state, chain state]
  reconcile --> evidence[Drill evidence\nRTT, restored point, owner sign-off]
  evidence --> decision{Targets met?}
  decision -->|Yes| ready[Keep evidence for production review]
  decision -->|No| fix[Update runbook, configuration, or target]
  fix --> backup
`"
/>

Treat the diagram as the minimum drill loop. A restore that stops at database recovery is incomplete until DALP services start, indexed state is reconciled with chain state, and the measured recovery time is recorded against the target.

If a restore test needs manual changes that are not in Git or the runbook, treat the drill as failed until the missing step is documented and repeated.

### Restore evidence to keep [#restore-evidence-to-keep]

| Evidence                       | Why it matters                                                                 |
| ------------------------------ | ------------------------------------------------------------------------------ |
| Backup name and creation time  | Shows which recovery point was used.                                           |
| Restore target timestamp       | Lets operators compare the intended RPO with the achieved restore point.       |
| Database checkpoint or WAL end | Confirms the database restored to the expected point before services started.  |
| Restored namespace inventory   | Confirms workloads, services, secrets, ingress, and persistent volumes exist.  |
| Object version check           | Confirms uploaded files and exported artifacts match the restored environment. |
| Application health checks      | Confirms DALP services can read the restored state and serve traffic.          |
| Reconciliation result          | Confirms indexed, off-chain, and on-chain state are consistent enough to run.  |

For Velero filesystem backups, include the pod resources in the restore. Restoring only persistent volume claims can recreate empty volumes because the node agent downloads filesystem data when the restored pods run the restore wait flow.

## Monitoring [#monitoring]

### Key metrics to monitor [#key-metrics-to-monitor]

* Pod availability for application health.
* Pod restart counts for service stability.
* PostgreSQL replication lag for data consistency.
* Backup success and failure counts for recoverability.
* Certificate expiration for TLS continuity.

### Recommended alerts [#recommended-alerts]

* Treat any backup failure as critical.
* Warn when replication lag exceeds 60 seconds.
* Warn when pod restarts exceed 5 per hour.
* Warn when a certificate expires in less than 14 days.
* Warn when disk usage exceeds 80%.

## DR testing requirements [#dr-testing-requirements]

### Quarterly DR drills [#quarterly-dr-drills]

1. Restore from backup to a test environment.
2. Verify data integrity.
3. Test application functionality.
4. Document recovery time.
5. Update runbooks if needed.

### Annual full DR test [#annual-full-dr-test]

1. Simulate cluster failure.
2. Execute the full recovery procedure.
3. Measure actual RTO/RPO against the deployment targets.
4. Report the result to the accountable operating team.
5. Update the SLA or operating target if needed.

## Related pages [#related-pages]

* [High availability overview](/docs/architecture/self-hosting/high-availability)
* [Cloud-native deployment](/docs/architecture/self-hosting/high-availability/cloud-native)
* [Self-hosting prerequisites](/docs/architecture/self-hosting/prerequisites)
* [Installation process](/docs/architecture/self-hosting/installation-process)
* [Execution Engine component architecture](/docs/architecture/components/infrastructure/execution-engine)
