SettleMint
Developer guidesOperations

Durable Execution Engine recovery

Recover Durable Execution Engine-backed DALP workflows with operator-only DAPI routes for health checks, redeployment, stale deployment cleanup, and stuck workflow recovery.

The Durable Execution Engine recovery routes sit behind the DALP operator API. They are for platform operators and trusted automation that already have system operate permission. They are not tenant-facing product routes.

Always run doctor first. It reads Durable Execution Engine ingress, admin API, deployment, service, and invocation health without changing workflow state. Use a write route only after the doctor response identifies the unhealthy component. Run doctor again after the write route to confirm the result.

DALP records each operator route attempt in the operations audit log. If a request includes a URL with username or password information, the audit row redacts that user information before storing the request arguments.

Rendering diagram...

In prose, the recovery path is inspect, act once, then verify. force-redeploy handles service registration. cleanup-stale-deployments handles old deployment records. recover-stuck-workflow handles one failed workflow key after DALP verifies that retry is safe.

Choose the recovery action

Start from the doctor readout and change only the component that is unhealthy. This keeps the incident narrow and gives the next operator a clean audit trail.

Doctor signalUse this routeWhy
The service URL is missing or the service needs to be registered againforce-redeployRegisters the live durable workflow service URL and returns the new deployment id.
More than one deployment record exists and you know which service URL should stay activecleanup-stale-deploymentsKeeps the matching deployment and drains/deletes other deployment records.
A specific workflow key is terminal-failed and safe to replayrecover-stuck-workflowKills and purges terminal-failed invocations for that key, then clears keyed workflow state.
The admin API, DNS, or network path is unreachableFix connectivity first, then rerun doctorThe write routes depend on the same admin API path.
The workflow has an active invocation or already completed successfullyDo not clear state with this routeDALP returns RESTATE_WORKFLOW_RETRY_BLOCKED because replay would be unsafe or unnecessary.

Prerequisites

Confirm these facts before you call a recovery route:

RequirementValue
API accessDALP admin API for the environment you operate.
PermissionSystem operate permission on the operator account or API key.
Authentication headerX-Api-Key: <operator api key>.
Service URLRequired for redeploy and cleanup. Use the exact deployments.items[].serviceUrl value from doctor when cleaning stale deployments.
Workflow identifiersRequired for workflow recovery. Use serviceName and serviceKey from a failed invocation record or workflow metadata.
Audit expectationEvery operator route writes an operations audit row with the actor, role, route, sanitized arguments, start time, finish time, outcome, and error when the call fails.

Set the example environment variables:

export DALP_API_URL="https://platform.example.com"
export DALP_API_KEY="sm_dalp_operator_1234567890"

Quickstart: inspect and re-register a workflow service

This quickstart uses one coherent recovery case: doctor reports the live durable workflow service URL, you re-register that URL, then you check doctor again.

1. Run doctor

curl -sS -X POST "$DALP_API_URL/api/v2/admin/operator/restate/doctor" \
  -H "X-Api-Key: $DALP_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{}'
{
  "ingress": { "status": "ok", "latencyMs": 12 },
  "admin": { "status": "ok", "latencyMs": 9, "version": "1.4.0" },
  "deployments": {
    "status": "ok",
    "items": [
      {
        "id": "dp_01j8m7k2q3r4s5t6u7v8w9x0y1",
        "serviceUrl": "http://ddwf:9080/v1",
        "createdAt": "2026-05-09T10:00:00.000Z"
      }
    ],
    "error": null
  },
  "services": {
    "status": "ok",
    "items": [{ "name": "IdentityRecoveryWorkflow", "revision": 3 }],
    "error": null
  },
  "invocations": {
    "status": "ok",
    "byStatus": { "invoked": 2, "suspended": 1 },
    "recentFailures": [
      {
        "id": "inv_01j8m7p2q3r4s5t6u7v8w9x0z2",
        "serviceName": "IdentityRecoveryWorkflow",
        "serviceKey": "invitation_01j8m7k2q3r4s5t6u7v8w9x0y1",
        "failedAt": "2026-05-09T10:07:19.000Z",
        "errorMessage": "TerminalError: upstream provider returned a terminal failure"
      }
    ],
    "error": null
  }
}

2. Re-register the service URL

Use the serviceUrl value from deployments.items. Keep the string exact, including scheme, host, port, path, and trailing slash if present.

curl -sS -X POST "$DALP_API_URL/api/v2/admin/operator/restate/force-redeploy" \
  -H "X-Api-Key: $DALP_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "serviceUrl": "http://ddwf:9080/v1",
    "force": true
  }'
{
  "acknowledged": true,
  "deploymentId": "dp_01j8m8b9c0d1e2f3g4h5j6k7m8"
}

3. Verify with doctor

curl -sS -X POST "$DALP_API_URL/api/v2/admin/operator/restate/doctor" \
  -H "X-Api-Key: $DALP_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{}'
{
  "ingress": { "status": "ok", "latencyMs": 10 },
  "admin": { "status": "ok", "latencyMs": 8, "version": "1.4.0" },
  "deployments": {
    "status": "ok",
    "items": [
      {
        "id": "dp_01j8m8b9c0d1e2f3g4h5j6k7m8",
        "serviceUrl": "http://ddwf:9080/v1",
        "createdAt": "2026-05-09T10:10:00.000Z"
      }
    ],
    "error": null
  },
  "services": {
    "status": "ok",
    "items": [{ "name": "IdentityRecoveryWorkflow", "revision": 3 }],
    "error": null
  },
  "invocations": {
    "status": "ok",
    "byStatus": { "invoked": 2, "suspended": 1 },
    "recentFailures": [
      {
        "id": "inv_01j8m7p2q3r4s5t6u7v8w9x0z2",
        "serviceName": "IdentityRecoveryWorkflow",
        "serviceKey": "invitation_01j8m7k2q3r4s5t6u7v8w9x0y1",
        "failedAt": "2026-05-09T10:07:19.000Z",
        "errorMessage": "TerminalError: upstream provider returned a terminal failure"
      }
    ],
    "error": null
  }
}

Redeploy changes service registration. It does not clear invocation failures. If recentFailures still names a blocked workflow key, inspect that workflow before closing the incident.

If doctor still shows old deployment records beside the deployment you want to keep, continue with stale deployment cleanup.

Recovery order

StepRoutePurposeState change
1POST /api/v2/admin/operator/restate/doctorInspect current health.None.
2One write routeFix the one unhealthy component doctor identified.Depends on the route.
3POST /api/v2/admin/operator/restate/doctorConfirm the affected component moved to ok or that the remaining failure is understood.None.

Do not skip the first doctor call. The write routes share the admin API dependency, and a broken admin path makes the write routes fail or block.

Doctor

doctor is read-only. It probes the Durable Execution Engine ingress URL, admin API, deployment list, service list, and invocation table. A failed component marks only that component as degraded or unreachable. The route fails only when DALP cannot resolve the admin URL before probing.

POST /api/v2/admin/operator/restate/doctor

Request body

Send an empty JSON object.

{}

Response fields

FieldTypeMeaning
ingress.statusok, unreachable, or degradedIngress probe result.
ingress.latencyMsNumber in milliseconds or nullProbe latency when measured.
admin.statusok, unreachable, or degradedAdmin API probe result.
admin.latencyMsNumber in milliseconds or nullAdmin API probe latency when measured.
admin.versionString or nullAdmin API version when returned.
deployments.statusok, unreachable, or degradedDeployment list probe result.
deployments.items[].idStringDurable Execution Engine deployment id.
deployments.items[].serviceUrlURL stringRegistered service URL. Preserve the exact value for cleanup.
deployments.items[].createdAtISO 8601 UTC timestampDeployment creation time returned by the admin API.
services.statusok, unreachable, or degradedService list probe result.
services.items[].nameStringRegistered service name.
services.items[].revisionNumber or omittedService revision when returned.
invocations.statusok, unreachable, or degradedInvocation table probe result.
invocations.byStatusObject keyed by invocation statusInvocation counts by status.
invocations.recentFailuresArray, maximum 20 itemsRecent failed invocations ordered by most recently modified first.
invocations.recentFailures[].serviceKeyString or nullWorkflow key for recovery. Use only when it is present.
error on readout objectsString or nullHuman-readable probe failure reason for that component.

Force redeploy

force-redeploy registers the durable workflow service URL with the Durable Execution Engine admin API. It returns the deployment id for the registration. It does not drain or delete old deployments.

POST /api/v2/admin/operator/restate/force-redeploy

Request body

FieldTypeRequiredConstraints and defaults
serviceUrlURL stringYesMust be the live durable workflow service URL that should be registered. Use configured service state or deployment metadata. Do not reuse an old doctor URL when you are replacing a stale deployment. Credentials in URLs are redacted in the operations audit log.
forceBooleanNoDefaults to true. When true, DALP sends a forced registration request to the admin API.
{
  "serviceUrl": "http://ddwf:9080/v1",
  "force": true
}

Success response

{
  "acknowledged": true,
  "deploymentId": "dp_01j8m8b9c0d1e2f3g4h5j6k7m8"
}

Operational notes

  • Run doctor before this route so you can compare deployment state before and after registration.
  • Run cleanup-stale-deployments after redeploy when doctor still shows old deployment rows.
  • Do not treat acknowledged: true as stale deployment cleanup. It means registration completed.

Cleanup stale deployments

cleanup-stale-deployments keeps the deployment matching serviceUrl. DALP resolves that deployment id server-side, then drains and deletes every other registered deployment. A failure to list deployments returns an admin-unreachable error. Per-deployment drain or delete failures are logged by the cleanup helper and do not change the success response shape.

POST /api/v2/admin/operator/restate/cleanup-stale-deployments

Request body

FieldTypeRequiredConstraints and defaults
serviceUrlURL stringYesMust exactly match the registered service URL of the deployment to keep. Copy it from deployments.items[].serviceUrl in doctor.
forceDrainBooleanNoDefaults to false. Use true only when stale deployments point to dead services and cannot drain normally.
{
  "serviceUrl": "http://ddwf:9080/v1",
  "forceDrain": false
}

Success response

{
  "acknowledged": true
}

Operational notes

  • forceDrain: false is the normal production value. DALP drains stale deployments through the admin API, but it can still kill pinned invocations when a stale deployment is unreachable and cannot drain normally.
  • forceDrain: true is for a known dead stale deployment where you accept forced drain behavior even when pending invocations remain.
  • Run doctor after cleanup and inspect deployments.items before you close the incident.

Recover stuck workflow

recover-stuck-workflow prepares one workflow key for retry. DALP queries prior invocations for the supplied (serviceName, serviceKey) pair, refuses unsafe recovery states, kills and purges terminal-failed invocations, then clears keyed workflow state. The next workflow submission starts from a blank state.

POST /api/v2/admin/operator/restate/recover-stuck-workflow

Request body

FieldTypeRequiredConstraints and defaults
serviceNameStringYesMust match ^[a-zA-Z0-9_-]+$. Use the service name from doctor, failed invocation metadata, or workflow metadata. No default.
serviceKeyStringYesMust match ^[a-zA-Z0-9_-]+$. Use the exact workflow key for the failed invocation. No default.
{
  "serviceName": "IdentityRecoveryWorkflow",
  "serviceKey": "invitation_01j8m7k2q3r4s5t6u7v8w9x0y1"
}

Success response

{
  "acknowledged": true
}

Blocked recovery response

DALP refuses recovery when the workflow has an active invocation or already succeeded. The error includes a reason and the relevant invocation ids.

{
  "code": "RESTATE_WORKFLOW_RETRY_BLOCKED",
  "message": "Workflow retry is blocked by an active invocation",
  "data": {
    "reason": "active-invocation",
    "invocationIds": ["inv_01j8m7p2q3r4s5t6u7v8w9x0z2"]
  }
}

Retry blocked reasons

ReasonWhat DALP observedState changeOperator responseRetry semantics
active-invocationA matching run invocation is not terminal. DALP treats every status except completed, killed, and cancelled as active, including pending, scheduled, ready, running, suspended, backing-off, paused, and unknown future statuses.DALP does not kill, purge, or clear state.Inspect the returned invocation ids. Wait for completion or handle the active invocation through the Durable Execution Engine admin tooling before retrying.Retry only after the active invocation is no longer active.
already-succeededA prior matching invocation already completed successfully.DALP does not kill, purge, or clear state.Do not replay the workflow blindly. Investigate why the caller or UI still reports a stuck state.Treat as terminal for recovery unless new evidence shows a different workflow key is stuck.
purge-failedDALP could not purge a terminal-failed invocation for a non-transport reason.DALP may have completed earlier recovery steps before the purge failed.Inspect admin logs and retry after the purge failure is resolved.Retry after the purge condition is fixed. Transport failures surface as RESTATE_ADMIN_UNREACHABLE.
query-failedDALP could not query invocation state for a non-transport reason.DALP does not clear workflow state.Escalate the malformed or unexpected admin response.Retry only after the query path is healthy. Transport failures surface as RESTATE_ADMIN_UNREACHABLE.

Error reference

ErrorHTTP classWhat DALP observedState change and audit behaviorOperator responseRetry semantics
Missing system operate permissionAuthorization failureThe caller is authenticated but not authorized for operator routes, or the API key lacks the required permission.The recovery handler does not run. Authorization failure behavior follows the operator route authorization layer.Use an operator account or API key with system operate permission.Retry only with corrected credentials.
RESTATE_ADMIN_UNREACHABLE5xx classDALP could not resolve the admin URL or could not reach the Durable Execution Engine admin API.The target recovery action is not acknowledged. Operator middleware records the failed attempt when the route reaches the operator handler.Check admin URL configuration, DNS, network policy, and admin API health. Run doctor after the admin path recovers.Retry after connectivity or configuration is fixed.
RESTATE_DEPLOYMENT_NOT_FOUND404 classThe supplied serviceUrl does not map to a registered deployment when DALP needs that mapping.DALP does not clean stale deployments because it cannot identify the deployment to keep. Operator middleware records the failed attempt.Run doctor and retry with the exact registered serviceUrl.Retry with the exact service URL from doctor.
RESTATE_WORKFLOW_RETRY_BLOCKED409 class for active or succeeded workflow stateWorkflow recovery found a condition that makes clearing state unsafe.DALP does not clear state for active or already-succeeded workflows. The error returns reason and invocationIds. Operator middleware records the failed attempt.Inspect the returned reason and invocation ids before taking further action.Depends on reason; use the retry blocked reasons table.

DAPI error responses include the normal request identifier for support correlation. Keep that identifier with the incident record and the operations audit row.

Audit, security, and production boundaries

TopicBehavior
Permission boundaryAll four routes require system operate permission and run through operator route middleware.
Audit logOperator route attempts create an operations audit row with actor user id, role at time, route, sanitized arguments, reason when supplied, start time, finish time, outcome, and error on failure.
Credentialed URL redactionIf an argument contains a URL with username or password information, the stored audit argument replaces that user information with REDACTED and preserves the host, path, query, and fragment.
Tenant scopeThe operations audit log is platform-global and keyed by operator actor, not tenant row ownership.
PCI scopeThese recovery routes do not accept cardholder data. Do not put card data, credentials, secrets, or private keys in request bodies.
KYC scopeThese routes do not perform KYC checks. They recover workflow infrastructure state and do not change identity verification status.
IdempotencyThe routes do not document an idempotency key contract. Use doctor before and after each write route instead of blind retries.
TimestampsDoctor timestamps use ISO 8601 UTC strings.
AvailabilityA healthy doctor response is operational evidence for this recovery surface. It is not an SLA, failover drill, or disaster recovery proof.

On this page