Cloud-native high availability
Use managed Kubernetes, managed data services, multi-zone placement, health probes, and backup tooling as the default high availability pattern for self-hosted DALP deployments.
Cloud-native high availability is the default pattern for self-hosted DALP when you can use managed Kubernetes, PostgreSQL, cache, object storage, and provider backup services in one region. The platform runs the application tier across availability zones, while managed services carry the strongest data durability and failover guarantees.
Start with cloud-native HA before considering hot-warm, hot-cold, or hot-hot designs. Alternative HA designs add cost and coordination work. The cloud-native pattern is usually the cleanest starting point for one-region production environments.
Architecture
Smallest production overlay
Start with the default cloud-native posture: keep the application replicas enabled and use the built-in probes. When managed services replace bundled stateful services, disable the bundled services in the same installation-specific values file. Then provide the external PostgreSQL, cache, object storage, route, resource, and observability values for that environment.
# values-production-ha.yaml for wrapper charts such as dalp-local or dalp-staging
dalp:
dapp:
replicaCount: 2
podAntiAffinityPreset: soft
support:
postgresql:
enabled: false
redis:
enabled: false
rustfs:
enabled: falseIf you install the dalp chart directly, place the dApp values under dapp instead of dalp.dapp. If you install the support chart separately, place the support values at the root of that chart's values file.
Keep the overlay intentionally small. The example shows the availability boundary without exposing credentials. A production values file also needs the target environment's PostgreSQL, Redis or Valkey, object storage, route, resource, and observability values.
What this pattern covers
| Layer | Cloud-native responsibility | DALP configuration surface |
|---|---|---|
| Application pods | Run multiple replicas and let Kubernetes replace unhealthy pods. | DALP chart values define replica counts, rolling update strategy, readiness probes, liveness probes, affinity, node selectors, and tolerations for services. |
| Zone placement | Spread application pods across failure domains when the cluster exposes zones. | Configure node pools and scheduling rules in the target environment. Use chart affinity and topology spread controls where the specific component exposes them. |
| PostgreSQL | Use provider-managed HA, point-in-time recovery, and backup retention where available. | Configure external PostgreSQL through global.datastores.*.postgresql connection settings or an existing secret. |
| Cache | Use a managed Redis or Valkey service when available. | Provide external Redis connection settings or an existing secret instead of relying on an in-cluster cache for production HA. |
| Object storage | Use cloud-provider or S3-compatible object storage for application files, document uploads, and backup targets. | Configure the storage endpoint and buckets before deployment. |
| Backup tooling | Use provider backups for managed services and Velero only for Kubernetes resources when needed. | Enable Velero only when the environment requires Kubernetes resource backup and restore. Keep database and object storage backup ownership explicit in the runbook. |
Application health and placement
DALP chart deployments expose Kubernetes health and placement controls rather than hiding availability behind a single switch.
For the dApp service, the chart sets two replicas by default and uses TCP liveness and readiness probes on the HTTP port. The chart also exposes pod affinity, pod anti-affinity, node affinity, node selectors, and tolerations. The default pod anti-affinity preset is soft. That setting lets the scheduler prefer separating replicas without blocking scheduling when the cluster is small.
For production, verify these controls before go-live:
- Keep at least two dApp replicas unless a documented maintenance window requires a temporary scale-down.
- Place worker nodes across the availability zones that the cloud region supports.
- Use anti-affinity or topology spread rules for components that expose them so a single node or zone does not hold every replica.
- Keep liveness and readiness probes enabled and alert on repeated restarts, readiness failures, and unavailable replicas.
- Set resource requests and limits so replacement pods can be scheduled during a node or zone incident.
Other DALP components can expose different probe endpoints and placement controls. Review the component chart before applying one component's probe path or scheduling setting to another.
Data services and backups
The recommended cloud-native model keeps state in managed services where the cloud provider supplies the failover and durability controls. DALP then connects to those services through chart values and secrets.
Use this split for production planning:
- Put PostgreSQL outside the application cluster when managed HA and PITR are available.
- Put Redis or Valkey and object storage in managed services where the platform allows it.
- Use Velero for Kubernetes resource recovery only when the environment requires cluster-level backups.
- Test restore procedures before production traffic goes live.
When managed services are not available, use the self-hosting prerequisites and backup pages to plan in-cluster PostgreSQL, object storage, and backup ownership.
This operating model needs its own recovery plan. It is not a small toggle on the cloud-native pattern.
Production readiness checks
Before production traffic, confirm each control has an owner, an alert, and a recovery test.
| Check | Minimum evidence |
|---|---|
| Replica posture | Application services that should survive pod or node loss have more than one replica, and singleton services are documented as singleton dependencies. |
| Scheduling posture | Node pools span failure domains, and placement rules do not pin all replicas to one node group or zone. |
| Managed PostgreSQL | HA mode, PITR, backup retention, credentials, failover behaviour, and application reconnection are tested. |
| Managed cache | The cache service has HA or failover enabled where the provider supports it. DALP credentials and TLS settings are tested. |
| Object storage | Buckets, retention, versioning or replication posture, and restore access match the selected RPO. |
| Kubernetes resources | Velero, GitOps, or another approved recovery method can restore required namespace resources. |
| Observability | Alerts cover API availability, pod restarts, readiness failures, database failover, cache availability, object storage access, and backup status. |
Recovery expectations
Cloud-native HA reduces downtime for ordinary pod, node, and zone failures, but the exact RTO and RPO come from the selected cloud services and the operator's runbooks.
| Failure type | Expected handling | What to verify |
|---|---|---|
| Pod failure | Kubernetes removes the pod from service and starts a replacement. | Readiness and liveness probes are enabled and alerting detects repeated restarts. |
| Node failure | The scheduler places replacement pods on healthy nodes. | Node pools span failure domains and placement rules do not pin all replicas to one node group. |
| Zone failure | Surviving zones continue serving if capacity and dependencies remain available. | Application replicas, managed PostgreSQL, cache, ingress, and object storage all have a tested zone-failure posture. |
| Database failure | Managed PostgreSQL handles failover according to the provider's HA model. | PITR, backup retention, failover behaviour, credentials, and application reconnection are tested. |
| Cluster resource loss | Backups or declarative deployment state rebuild Kubernetes resources. | Velero or the platform's GitOps process can restore the required resources. |
Provider patterns
| Provider family | Typical building blocks |
|---|---|
| AWS | EKS across availability zones, RDS Multi-AZ, ElastiCache Multi-AZ, S3, and provider backup policies. |
| Azure | AKS with zone-aware node pools, Azure Database zone-redundant HA, Azure Cache zone redundancy, Blob Storage ZRS or GRS, and provider backup policies. |
| GCP | Regional GKE, multi-zone node pools, Cloud SQL Regional HA, Memorystore Standard tier, Cloud Storage, and provider backup policies. |
| OpenShift | Multi-master OpenShift, worker nodes across failure domains, OpenShift routing, and approved persistent storage or managed data services. |
When not to use this pattern
Choose another HA pattern when the deployment needs a different recovery model:
- Use hot-warm when a standby environment must be ready but not fully active.
- Use hot-cold when cost matters more than recovery speed.
- Use hot-hot only when the operating model, data consistency controls, and network design can support active-active service.
- Use backup and recovery to define restore tests, evidence, and RTO/RPO validation for any pattern.
Related pages
High availability
How self-hosted DALP operators choose a high availability and disaster recovery pattern, with recovery metrics, ownership boundaries, and links to the supported deployment scenarios.
Hot-warm active-standby
Run one active DALP cluster with a warm standby cluster, replicated PostgreSQL data, staged validator operations, and a manual regional failover runbook.