ArchitectureSelf-HostingHigh Availability
Backup & Recovery
Backup strategy, tiered schedules, PostgreSQL PITR, DR testing requirements, and cloud provider-specific HA configurations for self-hosted DALP deployments.
Purpose: Document the backup strategy, tiered schedules, PITR configuration, and DR testing requirements.
- Doc type: Reference
- Related: HA Overview, Cloud-native, Observability
What gets backed up
| Component | Backup method | Frequency | Retention |
|---|---|---|---|
| PostgreSQL data | Managed PITR or CNPG WAL shipping to object storage | Continuous | 30 days |
| Kubernetes resources | Velero backups (file-level, snapshots optional) | Hourly/Daily/Weekly | 48h/7d/30d |
| Object storage | Bucket versioning | Automatic | 90 days |
| Observability data | Velero backups when self-hosted | Daily | 3 days |
| Configuration | Helm values in Git | Every change | Indefinite |
Tiered backup schedule
Default Velero schedules:
| Schedule | Timing | Retention | Contents |
|---|---|---|---|
| Hourly | Every hour | 48 hours | ConfigMaps, Secrets |
| Daily | 3 AM daily | 7 days | Full namespace snapshot |
| Weekly | 4 AM Sunday | 30 days | Full namespace + volumes |
PostgreSQL PITR
For CloudNativePG deployments:
- WAL shipping to object storage (continuous)
- Base backups daily
- Point-in-time recovery to any moment within retention window
Velero can optionally use CSI snapshots when a compatible CSI driver and VolumeSnapshot CRDs are installed. If not available, Velero performs file-level backups.
Monitoring
Key metrics to monitor
| Metric | Purpose |
|---|---|
| Pod availability | Application health |
| Pod restart counts | Stability indicator |
| PostgreSQL replication lag | Data consistency |
| Backup success/failure | Recovery capability |
| Certificate expiration | TLS continuity |
Recommended alerts
| Alert | Threshold | Severity |
|---|---|---|
| Backup failed | Any failure | Critical |
| Replication lag | Over 60 seconds | Warning |
| Pod restart | Over 5 per hour | Warning |
| Certificate expiring | Under 14 days | Warning |
| Disk usage | Over 80% | Warning |
DR testing requirements
Quarterly DR drills
- Restore from backup to test environment
- Verify data integrity
- Test application functionality
- Document recovery time
- Update runbooks if needed
Annual full DR test
- Simulate cluster failure
- Execute full recovery procedure
- Measure actual RTO/RPO vs targets
- Report to stakeholders
- Update SLA if needed
See also
- Prerequisites for infrastructure requirements
- Installation process for deployment steps
- DALP Execution Engine for component architecture