SettleMint
ArchitectureSelf-HostingHigh Availability

Backup & Recovery

Backup strategy, tiered schedules, PostgreSQL PITR, DR testing requirements, and cloud provider-specific HA configurations for self-hosted DALP deployments.

Purpose: Document the backup strategy, tiered schedules, PITR configuration, and DR testing requirements.


What gets backed up

ComponentBackup methodFrequencyRetention
PostgreSQL dataManaged PITR or CNPG WAL shipping to object storageContinuous30 days
Kubernetes resourcesVelero backups (file-level, snapshots optional)Hourly/Daily/Weekly48h/7d/30d
Object storageBucket versioningAutomatic90 days
Observability dataVelero backups when self-hostedDaily3 days
ConfigurationHelm values in GitEvery changeIndefinite

Tiered backup schedule

Default Velero schedules:

ScheduleTimingRetentionContents
HourlyEvery hour48 hoursConfigMaps, Secrets
Daily3 AM daily7 daysFull namespace snapshot
Weekly4 AM Sunday30 daysFull namespace + volumes

PostgreSQL PITR

For CloudNativePG deployments:

  • WAL shipping to object storage (continuous)
  • Base backups daily
  • Point-in-time recovery to any moment within retention window

Velero can optionally use CSI snapshots when a compatible CSI driver and VolumeSnapshot CRDs are installed. If not available, Velero performs file-level backups.

Monitoring

Key metrics to monitor

MetricPurpose
Pod availabilityApplication health
Pod restart countsStability indicator
PostgreSQL replication lagData consistency
Backup success/failureRecovery capability
Certificate expirationTLS continuity
AlertThresholdSeverity
Backup failedAny failureCritical
Replication lagOver 60 secondsWarning
Pod restartOver 5 per hourWarning
Certificate expiringUnder 14 daysWarning
Disk usageOver 80%Warning

DR testing requirements

Quarterly DR drills

  1. Restore from backup to test environment
  2. Verify data integrity
  3. Test application functionality
  4. Document recovery time
  5. Update runbooks if needed

Annual full DR test

  1. Simulate cluster failure
  2. Execute full recovery procedure
  3. Measure actual RTO/RPO vs targets
  4. Report to stakeholders
  5. Update SLA if needed

See also

On this page