# Cloud-native high availability

Source: https://docs.settlemint.com/docs/architecture/self-hosting/high-availability/cloud-native
Use managed Kubernetes, managed data services, multi-zone placement, health probes, and backup tooling as the default high availability pattern for self-hosted DALP deployments.


Cloud-native high availability is the default pattern for self-hosted DALP when you can use managed Kubernetes, PostgreSQL, cache, object storage, and provider backup services in one region. The platform runs the application tier across availability zones, while managed services carry the strongest data durability and failover guarantees.

Start with cloud-native HA before considering hot-warm, hot-cold, or hot-hot designs. Alternative HA designs add cost and coordination work. The cloud-native pattern is usually the cleanest starting point for one-region production environments.

## Architecture [#architecture]

<Mermaid
  chart="`flowchart TB
  user[&#x22;Users and operators&#x22;] --> edge[&#x22;DNS, ingress, or load balancer&#x22;]

  subgraph region[&#x22;One cloud region&#x22;]
    edge --> cluster[&#x22;Managed Kubernetes or OpenShift&#x22;]

    subgraph zones[&#x22;Worker pools across availability zones&#x22;]
      za[&#x22;Zone A application pods&#x22;]
      zb[&#x22;Zone B application pods&#x22;]
      zc[&#x22;Zone C application pods&#x22;]
    end

    cluster --> za
    cluster --> zb
    cluster --> zc

    subgraph managed[&#x22;Managed data and storage services&#x22;]
      pg[&#x22;PostgreSQL with HA and PITR&#x22;]
      cache[&#x22;Managed Redis or Valkey&#x22;]
      obj[&#x22;Object storage&#x22;]
      backup[&#x22;Backup and restore tooling&#x22;]
    end

    za --> managed
    zb --> managed
    zc --> managed
  end

`"
/>

## Smallest production overlay [#smallest-production-overlay]

Start with the default cloud-native posture: keep the application replicas enabled and use the built-in probes. When managed services replace bundled stateful services, disable the bundled services in the same installation-specific values file. Then provide the external PostgreSQL, cache, object storage, route, resource, and observability values for that environment.

```yaml
# values-production-ha.yaml for wrapper charts such as dalp-local or dalp-staging
dalp:
  dapp:
    replicaCount: 2
    podAntiAffinityPreset: soft

support:
  postgresql:
    enabled: false
  redis:
    enabled: false
  rustfs:
    enabled: false
```

If you install the `dalp` chart directly, place the dApp values under `dapp` instead of `dalp.dapp`. If you install the support chart separately, place the support values at the root of that chart's values file.

Keep the overlay intentionally small. The example shows the availability boundary without exposing credentials. A production values file also needs the target environment's PostgreSQL, Redis or Valkey, object storage, route, resource, and observability values.

## What this pattern covers [#what-this-pattern-covers]

| Layer            | Cloud-native responsibility                                                                                     | DALP configuration surface                                                                                                                                          |
| ---------------- | --------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Application pods | Run multiple replicas and let Kubernetes replace unhealthy pods.                                                | DALP chart values define replica counts, rolling update strategy, readiness probes, liveness probes, affinity, node selectors, and tolerations for services.        |
| Zone placement   | Spread application pods across failure domains when the cluster exposes zones.                                  | Configure node pools and scheduling rules in the target environment. Use chart affinity and topology spread controls where the specific component exposes them.     |
| PostgreSQL       | Use provider-managed HA, point-in-time recovery, and backup retention where available.                          | Configure external PostgreSQL through `global.datastores.*.postgresql` connection settings or an existing secret.                                                   |
| Cache            | Use a managed Redis or Valkey service when available.                                                           | Provide external Redis connection settings or an existing secret instead of relying on an in-cluster cache for production HA.                                       |
| Object storage   | Use cloud-provider or S3-compatible object storage for application files, document uploads, and backup targets. | Configure the storage endpoint and buckets before deployment.                                                                                                       |
| Backup tooling   | Use provider backups for managed services and Velero only for Kubernetes resources when needed.                 | Enable Velero only when the environment requires Kubernetes resource backup and restore. Keep database and object storage backup ownership explicit in the runbook. |

## Application health and placement [#application-health-and-placement]

DALP chart deployments expose Kubernetes health and placement controls rather than hiding availability behind a single switch.

For the dApp service, the chart sets two replicas by default and uses TCP liveness and readiness probes on the HTTP port. The chart also exposes pod affinity, pod anti-affinity, node affinity, node selectors, and tolerations. The default pod anti-affinity preset is `soft`. That setting lets the scheduler prefer separating replicas without blocking scheduling when the cluster is small.

For production, verify these controls before go-live:

1. Keep at least two dApp replicas unless a documented maintenance window requires a temporary scale-down.
2. Place worker nodes across the availability zones that the cloud region supports.
3. Use anti-affinity or topology spread rules for components that expose them so a single node or zone does not hold every replica.
4. Keep liveness and readiness probes enabled and alert on repeated restarts, readiness failures, and unavailable replicas.
5. Set resource requests and limits so replacement pods can be scheduled during a node or zone incident.

Other DALP components can expose different probe endpoints and placement controls. Review the component chart before applying one component's probe path or scheduling setting to another.

## Data services and backups [#data-services-and-backups]

The recommended cloud-native model keeps state in managed services where the cloud provider supplies the failover and durability controls. DALP then connects to those services through chart values and secrets.

Use this split for production planning:

1. Put PostgreSQL outside the application cluster when managed HA and PITR are available.
2. Put Redis or Valkey and object storage in managed services where the platform allows it.
3. Use Velero for Kubernetes resource recovery only when the environment requires cluster-level backups.
4. Test restore procedures before production traffic goes live.

When managed services are not available, use the self-hosting prerequisites and backup pages to plan in-cluster PostgreSQL, object storage, and backup ownership.

This operating model needs its own recovery plan. It is not a small toggle on the cloud-native pattern.

## Production readiness checks [#production-readiness-checks]

Before production traffic, confirm each control has an owner, an alert, and a recovery test.

| Check                | Minimum evidence                                                                                                                                       |
| -------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
| Replica posture      | Application services that should survive pod or node loss have more than one replica, and singleton services are documented as singleton dependencies. |
| Scheduling posture   | Node pools span failure domains, and placement rules do not pin all replicas to one node group or zone.                                                |
| Managed PostgreSQL   | HA mode, PITR, backup retention, credentials, failover behaviour, and application reconnection are tested.                                             |
| Managed cache        | The cache service has HA or failover enabled where the provider supports it. DALP credentials and TLS settings are tested.                             |
| Object storage       | Buckets, retention, versioning or replication posture, and restore access match the selected RPO.                                                      |
| Kubernetes resources | Velero, GitOps, or another approved recovery method can restore required namespace resources.                                                          |
| Observability        | Alerts cover API availability, pod restarts, readiness failures, database failover, cache availability, object storage access, and backup status.      |

## Recovery expectations [#recovery-expectations]

Cloud-native HA reduces downtime for ordinary pod, node, and zone failures, but the exact RTO and RPO come from the selected cloud services and the operator's runbooks.

| Failure type          | Expected handling                                                               | What to verify                                                                                                       |
| --------------------- | ------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------- |
| Pod failure           | Kubernetes removes the pod from service and starts a replacement.               | Readiness and liveness probes are enabled and alerting detects repeated restarts.                                    |
| Node failure          | The scheduler places replacement pods on healthy nodes.                         | Node pools span failure domains and placement rules do not pin all replicas to one node group.                       |
| Zone failure          | Surviving zones continue serving if capacity and dependencies remain available. | Application replicas, managed PostgreSQL, cache, ingress, and object storage all have a tested zone-failure posture. |
| Database failure      | Managed PostgreSQL handles failover according to the provider's HA model.       | PITR, backup retention, failover behaviour, credentials, and application reconnection are tested.                    |
| Cluster resource loss | Backups or declarative deployment state rebuild Kubernetes resources.           | Velero or the platform's GitOps process can restore the required resources.                                          |

## Provider patterns [#provider-patterns]

| Provider family | Typical building blocks                                                                                                                               |
| --------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
| AWS             | EKS across availability zones, RDS Multi-AZ, ElastiCache Multi-AZ, S3, and provider backup policies.                                                  |
| Azure           | AKS with zone-aware node pools, Azure Database zone-redundant HA, Azure Cache zone redundancy, Blob Storage ZRS or GRS, and provider backup policies. |
| GCP             | Regional GKE, multi-zone node pools, Cloud SQL Regional HA, Memorystore Standard tier, Cloud Storage, and provider backup policies.                   |
| OpenShift       | Multi-master OpenShift, worker nodes across failure domains, OpenShift routing, and approved persistent storage or managed data services.             |

## When not to use this pattern [#when-not-to-use-this-pattern]

Choose another HA pattern when the deployment needs a different recovery model:

* Use [hot-warm](/docs/architecture/self-hosting/high-availability/hot-warm) when a standby environment must be ready but not fully active.
* Use [hot-cold](/docs/architecture/self-hosting/high-availability/hot-cold) when cost matters more than recovery speed.
* Use [hot-hot](/docs/architecture/self-hosting/high-availability/hot-hot) only when the operating model, data consistency controls, and network design can support active-active service.
* Use [backup and recovery](/docs/architecture/self-hosting/high-availability/backup-recovery) to define restore tests, evidence, and RTO/RPO validation for any pattern.

## Related pages [#related-pages]

* [High availability overview](/docs/architecture/self-hosting/high-availability)
* [Self-hosting prerequisites](/docs/architecture/self-hosting/prerequisites)
* [Installation process](/docs/architecture/self-hosting/installation-process)
* [Backup and recovery](/docs/architecture/self-hosting/high-availability/backup-recovery)