SettleMint
ArchitectureSelf-HostingHigh Availability

Cloud-native high availability

Use managed Kubernetes, managed data services, multi-zone placement, health probes, and backup tooling as the default high availability pattern for self-hosted DALP deployments.

Cloud-native high availability is the default pattern for self-hosted DALP when you can use managed Kubernetes, PostgreSQL, cache, object storage, and provider backup services in one region. The platform runs the application tier across availability zones, while managed services carry the strongest data durability and failover guarantees.

Start with cloud-native HA before considering hot-warm, hot-cold, or hot-hot designs. Alternative HA designs add cost and coordination work. The cloud-native pattern is usually the cleanest starting point for one-region production environments.

Architecture

Rendering diagram...

Smallest production overlay

Start with the default cloud-native posture: keep the application replicas enabled and use the built-in probes. When managed services replace bundled stateful services, disable the bundled services in the same installation-specific values file. Then provide the external PostgreSQL, cache, object storage, route, resource, and observability values for that environment.

# values-production-ha.yaml for wrapper charts such as dalp-local or dalp-staging
dalp:
  dapp:
    replicaCount: 2
    podAntiAffinityPreset: soft

support:
  postgresql:
    enabled: false
  redis:
    enabled: false
  rustfs:
    enabled: false

If you install the dalp chart directly, place the dApp values under dapp instead of dalp.dapp. If you install the support chart separately, place the support values at the root of that chart's values file.

Keep the overlay intentionally small. The example shows the availability boundary without exposing credentials. A production values file also needs the target environment's PostgreSQL, Redis or Valkey, object storage, route, resource, and observability values.

What this pattern covers

LayerCloud-native responsibilityDALP configuration surface
Application podsRun multiple replicas and let Kubernetes replace unhealthy pods.DALP chart values define replica counts, rolling update strategy, readiness probes, liveness probes, affinity, node selectors, and tolerations for services.
Zone placementSpread application pods across failure domains when the cluster exposes zones.Configure node pools and scheduling rules in the target environment. Use chart affinity and topology spread controls where the specific component exposes them.
PostgreSQLUse provider-managed HA, point-in-time recovery, and backup retention where available.Configure external PostgreSQL through global.datastores.*.postgresql connection settings or an existing secret.
CacheUse a managed Redis or Valkey service when available.Provide external Redis connection settings or an existing secret instead of relying on an in-cluster cache for production HA.
Object storageUse cloud-provider or S3-compatible object storage for application files, document uploads, and backup targets.Configure the storage endpoint and buckets before deployment.
Backup toolingUse provider backups for managed services and Velero only for Kubernetes resources when needed.Enable Velero only when the environment requires Kubernetes resource backup and restore. Keep database and object storage backup ownership explicit in the runbook.

Application health and placement

DALP chart deployments expose Kubernetes health and placement controls rather than hiding availability behind a single switch.

For the dApp service, the chart sets two replicas by default and uses TCP liveness and readiness probes on the HTTP port. The chart also exposes pod affinity, pod anti-affinity, node affinity, node selectors, and tolerations. The default pod anti-affinity preset is soft. That setting lets the scheduler prefer separating replicas without blocking scheduling when the cluster is small.

For production, verify these controls before go-live:

  1. Keep at least two dApp replicas unless a documented maintenance window requires a temporary scale-down.
  2. Place worker nodes across the availability zones that the cloud region supports.
  3. Use anti-affinity or topology spread rules for components that expose them so a single node or zone does not hold every replica.
  4. Keep liveness and readiness probes enabled and alert on repeated restarts, readiness failures, and unavailable replicas.
  5. Set resource requests and limits so replacement pods can be scheduled during a node or zone incident.

Other DALP components can expose different probe endpoints and placement controls. Review the component chart before applying one component's probe path or scheduling setting to another.

Data services and backups

The recommended cloud-native model keeps state in managed services where the cloud provider supplies the failover and durability controls. DALP then connects to those services through chart values and secrets.

Use this split for production planning:

  1. Put PostgreSQL outside the application cluster when managed HA and PITR are available.
  2. Put Redis or Valkey and object storage in managed services where the platform allows it.
  3. Use Velero for Kubernetes resource recovery only when the environment requires cluster-level backups.
  4. Test restore procedures before production traffic goes live.

When managed services are not available, use the self-hosting prerequisites and backup pages to plan in-cluster PostgreSQL, object storage, and backup ownership.

This operating model needs its own recovery plan. It is not a small toggle on the cloud-native pattern.

Production readiness checks

Before production traffic, confirm each control has an owner, an alert, and a recovery test.

CheckMinimum evidence
Replica postureApplication services that should survive pod or node loss have more than one replica, and singleton services are documented as singleton dependencies.
Scheduling postureNode pools span failure domains, and placement rules do not pin all replicas to one node group or zone.
Managed PostgreSQLHA mode, PITR, backup retention, credentials, failover behaviour, and application reconnection are tested.
Managed cacheThe cache service has HA or failover enabled where the provider supports it. DALP credentials and TLS settings are tested.
Object storageBuckets, retention, versioning or replication posture, and restore access match the selected RPO.
Kubernetes resourcesVelero, GitOps, or another approved recovery method can restore required namespace resources.
ObservabilityAlerts cover API availability, pod restarts, readiness failures, database failover, cache availability, object storage access, and backup status.

Recovery expectations

Cloud-native HA reduces downtime for ordinary pod, node, and zone failures, but the exact RTO and RPO come from the selected cloud services and the operator's runbooks.

Failure typeExpected handlingWhat to verify
Pod failureKubernetes removes the pod from service and starts a replacement.Readiness and liveness probes are enabled and alerting detects repeated restarts.
Node failureThe scheduler places replacement pods on healthy nodes.Node pools span failure domains and placement rules do not pin all replicas to one node group.
Zone failureSurviving zones continue serving if capacity and dependencies remain available.Application replicas, managed PostgreSQL, cache, ingress, and object storage all have a tested zone-failure posture.
Database failureManaged PostgreSQL handles failover according to the provider's HA model.PITR, backup retention, failover behaviour, credentials, and application reconnection are tested.
Cluster resource lossBackups or declarative deployment state rebuild Kubernetes resources.Velero or the platform's GitOps process can restore the required resources.

Provider patterns

Provider familyTypical building blocks
AWSEKS across availability zones, RDS Multi-AZ, ElastiCache Multi-AZ, S3, and provider backup policies.
AzureAKS with zone-aware node pools, Azure Database zone-redundant HA, Azure Cache zone redundancy, Blob Storage ZRS or GRS, and provider backup policies.
GCPRegional GKE, multi-zone node pools, Cloud SQL Regional HA, Memorystore Standard tier, Cloud Storage, and provider backup policies.
OpenShiftMulti-master OpenShift, worker nodes across failure domains, OpenShift routing, and approved persistent storage or managed data services.

When not to use this pattern

Choose another HA pattern when the deployment needs a different recovery model:

  • Use hot-warm when a standby environment must be ready but not fully active.
  • Use hot-cold when cost matters more than recovery speed.
  • Use hot-hot only when the operating model, data consistency controls, and network design can support active-active service.
  • Use backup and recovery to define restore tests, evidence, and RTO/RPO validation for any pattern.

On this page