Durable Execution Engine recovery operations in DALP

Recover Durable Execution Engine-backed DALP workflows with operator-only DAPI routes for health checks, redeployment, stale deployment cleanup, and stuck workflow recovery.

The Durable Execution Engine recovery routes sit behind the DALP operator API. They are for platform operators and trusted automation that already have system operate permission. They are not tenant-facing product routes.

Always run doctor first. It reads Durable Execution Engine ingress, admin API, deployment, service, and invocation health without changing workflow state. Use a write route only after the doctor response identifies the unhealthy component. Run doctor again after the write route to confirm the result.

DALP records each operator route attempt in the operations audit log. If a request includes a URL with username or password information, the audit row redacts that user information before storing the request arguments.

Rendering diagram...

In prose, the recovery path is inspect, act once, then verify. force-redeploy handles service registration. cleanup-stale-deployments handles old deployment records. recover-stuck-workflow handles one failed workflow key after DALP verifies that retry is safe.

Choose the recovery action

Start from the doctor readout and change only the component that is unhealthy. This keeps the incident narrow and gives the next operator a clean audit trail.

Doctor signal	Use this route	Why
The service URL is missing or the service needs to be registered again	`force-redeploy`	Registers the live durable workflow service URL and returns the new deployment id.
More than one deployment record exists and you know which service URL should stay active	`cleanup-stale-deployments`	Keeps the matching deployment and drains/deletes other deployment records.
A specific workflow key is terminal-failed and safe to replay	`recover-stuck-workflow`	Kills and purges terminal-failed invocations for that key, then clears keyed workflow state.
The admin API, DNS, or network path is unreachable	Fix connectivity first, then rerun `doctor`	The write routes depend on the same admin API path.
The workflow has an active invocation or already completed successfully	Do not clear state with this route	DALP returns `RESTATE_WORKFLOW_RETRY_BLOCKED` because replay would be unsafe or unnecessary.

Prerequisites

Confirm these facts before you call a recovery route:

Requirement	Value
API access	DALP admin API for the environment you operate.
Permission	System operate permission on the operator account or API key.
Authentication header	`X-Api-Key: <operator api key>`.
Service URL	Required for redeploy and cleanup. Use the exact `deployments.items[].serviceUrl` value from doctor when cleaning stale deployments.
Workflow identifiers	Required for workflow recovery. Use `serviceName` and `serviceKey` from a failed invocation record or workflow metadata.
Audit expectation	Every operator route writes an operations audit row with the actor, role, route, sanitized arguments, start time, finish time, outcome, and error when the call fails.

Set the example environment variables:

export DALP_API_URL="https://platform.example.com"
export DALP_API_KEY="sm_dalp_operator_1234567890"

Quickstart: inspect and re-register a workflow service

This quickstart uses one coherent recovery case: doctor reports the live durable workflow service URL, you re-register that URL, then you check doctor again.

1. Run doctor

curl -sS -X POST "$DALP_API_URL/api/v2/admin/operator/restate/doctor" \
  -H "X-Api-Key: $DALP_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{}'

{
  "ingress": { "status": "ok", "latencyMs": 12 },
  "admin": { "status": "ok", "latencyMs": 9, "version": "1.4.0" },
  "deployments": {
    "status": "ok",
    "items": [
      {
        "id": "dp_01j8m7k2q3r4s5t6u7v8w9x0y1",
        "serviceUrl": "http://ddwf:9080/v1",
        "createdAt": "2026-05-09T10:00:00.000Z"
      }
    ],
    "error": null
  },
  "services": {
    "status": "ok",
    "items": [{ "name": "IdentityRecoveryWorkflow", "revision": 3 }],
    "error": null
  },
  "invocations": {
    "status": "ok",
    "byStatus": { "invoked": 2, "suspended": 1 },
    "recentFailures": [
      {
        "id": "inv_01j8m7p2q3r4s5t6u7v8w9x0z2",
        "serviceName": "IdentityRecoveryWorkflow",
        "serviceKey": "invitation_01j8m7k2q3r4s5t6u7v8w9x0y1",
        "failedAt": "2026-05-09T10:07:19.000Z",
        "errorMessage": "TerminalError: upstream provider returned a terminal failure"
      }
    ],
    "error": null
  }
}

2. Re-register the service URL

Use the serviceUrl value from deployments.items. Keep the string exact, including scheme, host, port, path, and trailing slash if present.

curl -sS -X POST "$DALP_API_URL/api/v2/admin/operator/restate/force-redeploy" \
  -H "X-Api-Key: $DALP_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "serviceUrl": "http://ddwf:9080/v1",
    "force": true
  }'

{
  "acknowledged": true,
  "deploymentId": "dp_01j8m8b9c0d1e2f3g4h5j6k7m8"
}

3. Verify with doctor

curl -sS -X POST "$DALP_API_URL/api/v2/admin/operator/restate/doctor" \
  -H "X-Api-Key: $DALP_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{}'

{
  "ingress": { "status": "ok", "latencyMs": 10 },
  "admin": { "status": "ok", "latencyMs": 8, "version": "1.4.0" },
  "deployments": {
    "status": "ok",
    "items": [
      {
        "id": "dp_01j8m8b9c0d1e2f3g4h5j6k7m8",
        "serviceUrl": "http://ddwf:9080/v1",
        "createdAt": "2026-05-09T10:10:00.000Z"
      }
    ],
    "error": null
  },
  "services": {
    "status": "ok",
    "items": [{ "name": "IdentityRecoveryWorkflow", "revision": 3 }],
    "error": null
  },
  "invocations": {
    "status": "ok",
    "byStatus": { "invoked": 2, "suspended": 1 },
    "recentFailures": [
      {
        "id": "inv_01j8m7p2q3r4s5t6u7v8w9x0z2",
        "serviceName": "IdentityRecoveryWorkflow",
        "serviceKey": "invitation_01j8m7k2q3r4s5t6u7v8w9x0y1",
        "failedAt": "2026-05-09T10:07:19.000Z",
        "errorMessage": "TerminalError: upstream provider returned a terminal failure"
      }
    ],
    "error": null
  }
}

Redeploy changes service registration. It does not clear invocation failures. If recentFailures still names a blocked workflow key, inspect that workflow before closing the incident.

If doctor still shows old deployment records beside the deployment you want to keep, continue with stale deployment cleanup.

Recovery order

Step	Route	Purpose	State change
1	`POST /api/v2/admin/operator/restate/doctor`	Inspect current health.	None.
2	One write route	Fix the one unhealthy component doctor identified.	Depends on the route.
3	`POST /api/v2/admin/operator/restate/doctor`	Confirm the affected component moved to `ok` or that the remaining failure is understood.	None.

Do not skip the first doctor call. The write routes share the admin API dependency, and a broken admin path makes the write routes fail or block.

Doctor

doctor is read-only. It probes the Durable Execution Engine ingress URL, admin API, deployment list, service list, and invocation table. A failed component marks only that component as degraded or unreachable. The route fails only when DALP cannot resolve the admin URL before probing.

POST /api/v2/admin/operator/restate/doctor

Request body

Send an empty JSON object.

{}

Response fields

Field	Type	Meaning
`ingress.status`	`ok`, `unreachable`, or `degraded`	Ingress probe result.
`ingress.latencyMs`	Number in milliseconds or `null`	Probe latency when measured.
`admin.status`	`ok`, `unreachable`, or `degraded`	Admin API probe result.
`admin.latencyMs`	Number in milliseconds or `null`	Admin API probe latency when measured.
`admin.version`	String or `null`	Admin API version when returned.
`deployments.status`	`ok`, `unreachable`, or `degraded`	Deployment list probe result.
`deployments.items[].id`	String	Durable Execution Engine deployment id.
`deployments.items[].serviceUrl`	URL string	Registered service URL. Preserve the exact value for cleanup.
`deployments.items[].createdAt`	ISO 8601 UTC timestamp	Deployment creation time returned by the admin API.
`services.status`	`ok`, `unreachable`, or `degraded`	Service list probe result.
`services.items[].name`	String	Registered service name.
`services.items[].revision`	Number or omitted	Service revision when returned.
`invocations.status`	`ok`, `unreachable`, or `degraded`	Invocation table probe result.
`invocations.byStatus`	Object keyed by invocation status	Invocation counts by status.
`invocations.recentFailures`	Array, maximum 20 items	Recent failed invocations ordered by most recently modified first.
`invocations.recentFailures[].serviceKey`	String or `null`	Workflow key for recovery. Use only when it is present.
`error` on readout objects	String or `null`	Human-readable probe failure reason for that component.

Force redeploy

force-redeploy registers the durable workflow service URL with the Durable Execution Engine admin API. It returns the deployment id for the registration. It does not drain or delete old deployments.

POST /api/v2/admin/operator/restate/force-redeploy

Request body

Field	Type	Required	Constraints and defaults
`serviceUrl`	URL string	Yes	Must be the live durable workflow service URL that should be registered. Use configured service state or deployment metadata. Do not reuse an old doctor URL when you are replacing a stale deployment. Credentials in URLs are redacted in the operations audit log.
`force`	Boolean	No	Defaults to `true`. When `true`, DALP sends a forced registration request to the admin API.

{
  "serviceUrl": "http://ddwf:9080/v1",
  "force": true
}

Success response

{
  "acknowledged": true,
  "deploymentId": "dp_01j8m8b9c0d1e2f3g4h5j6k7m8"
}

Operational notes

Run doctor before this route so you can compare deployment state before and after registration.
Run cleanup-stale-deployments after redeploy when doctor still shows old deployment rows.
Do not treat acknowledged: true as stale deployment cleanup. It means registration completed.

Cleanup stale deployments

cleanup-stale-deployments keeps the deployment matching serviceUrl. DALP resolves that deployment id server-side, then drains and deletes every other registered deployment. A failure to list deployments returns an admin-unreachable error. Per-deployment drain or delete failures are logged by the cleanup helper and do not change the success response shape.

POST /api/v2/admin/operator/restate/cleanup-stale-deployments

Request body

Field	Type	Required	Constraints and defaults
`serviceUrl`	URL string	Yes	Must exactly match the registered service URL of the deployment to keep. Copy it from `deployments.items[].serviceUrl` in doctor.
`forceDrain`	Boolean	No	Defaults to `false`. Use `true` only when stale deployments point to dead services and cannot drain normally.

{
  "serviceUrl": "http://ddwf:9080/v1",
  "forceDrain": false
}

Success response

{
  "acknowledged": true
}

Operational notes

forceDrain: false is the normal production value. DALP drains stale deployments through the admin API, but it can still kill pinned invocations when a stale deployment is unreachable and cannot drain normally.
forceDrain: true is for a known dead stale deployment where you accept forced drain behavior even when pending invocations remain.
Run doctor after cleanup and inspect deployments.items before you close the incident.

Recover stuck workflow

recover-stuck-workflow prepares one workflow key for retry. DALP queries prior invocations for the supplied (serviceName, serviceKey) pair, refuses unsafe recovery states, kills and purges terminal-failed invocations, then clears keyed workflow state. The next workflow submission starts from a blank state.

POST /api/v2/admin/operator/restate/recover-stuck-workflow

Request body

Field	Type	Required	Constraints and defaults
`serviceName`	String	Yes	Must match `^[a-zA-Z0-9_-]+$`. Use the service name from doctor, failed invocation metadata, or workflow metadata. No default.
`serviceKey`	String	Yes	Must match `^[a-zA-Z0-9_-]+$`. Use the exact workflow key for the failed invocation. No default.

{
  "serviceName": "IdentityRecoveryWorkflow",
  "serviceKey": "invitation_01j8m7k2q3r4s5t6u7v8w9x0y1"
}

Success response

{
  "acknowledged": true
}

Blocked recovery response

DALP refuses recovery when the workflow has an active invocation or already succeeded. The error includes a reason and the relevant invocation ids.

{
  "code": "RESTATE_WORKFLOW_RETRY_BLOCKED",
  "message": "Workflow retry is blocked by an active invocation",
  "data": {
    "reason": "active-invocation",
    "invocationIds": ["inv_01j8m7p2q3r4s5t6u7v8w9x0z2"]
  }
}

Retry blocked reasons

Reason	What DALP observed	State change	Operator response	Retry semantics
`active-invocation`	A matching `run` invocation is not terminal. DALP treats every status except `completed`, `killed`, and `cancelled` as active, including `pending`, `scheduled`, `ready`, `running`, `suspended`, `backing-off`, `paused`, and unknown future statuses.	DALP does not kill, purge, or clear state.	Inspect the returned invocation ids. Wait for completion or handle the active invocation through the Durable Execution Engine admin tooling before retrying.	Retry only after the active invocation is no longer active.
`already-succeeded`	A prior matching invocation already completed successfully.	DALP does not kill, purge, or clear state.	Do not replay the workflow blindly. Investigate why the caller or UI still reports a stuck state.	Treat as terminal for recovery unless new evidence shows a different workflow key is stuck.
`purge-failed`	DALP could not purge a terminal-failed invocation for a non-transport reason.	DALP may have completed earlier recovery steps before the purge failed.	Inspect admin logs and retry after the purge failure is resolved.	Retry after the purge condition is fixed. Transport failures surface as `RESTATE_ADMIN_UNREACHABLE`.
`query-failed`	DALP could not query invocation state for a non-transport reason.	DALP does not clear workflow state.	Escalate the malformed or unexpected admin response.	Retry only after the query path is healthy. Transport failures surface as `RESTATE_ADMIN_UNREACHABLE`.

Error reference

Error	HTTP class	What DALP observed	State change and audit behavior	Operator response	Retry semantics
Missing system operate permission	Authorization failure	The caller is authenticated but not authorized for operator routes, or the API key lacks the required permission.	The recovery handler does not run. Authorization failure behavior follows the operator route authorization layer.	Use an operator account or API key with system operate permission.	Retry only with corrected credentials.
`RESTATE_ADMIN_UNREACHABLE`	5xx class	DALP could not resolve the admin URL or could not reach the Durable Execution Engine admin API.	The target recovery action is not acknowledged. Operator middleware records the failed attempt when the route reaches the operator handler.	Check admin URL configuration, DNS, network policy, and admin API health. Run doctor after the admin path recovers.	Retry after connectivity or configuration is fixed.
`RESTATE_DEPLOYMENT_NOT_FOUND`	404 class	The supplied `serviceUrl` does not map to a registered deployment when DALP needs that mapping.	DALP does not clean stale deployments because it cannot identify the deployment to keep. Operator middleware records the failed attempt.	Run doctor and retry with the exact registered `serviceUrl`.	Retry with the exact service URL from doctor.
`RESTATE_WORKFLOW_RETRY_BLOCKED`	409 class for active or succeeded workflow state	Workflow recovery found a condition that makes clearing state unsafe.	DALP does not clear state for active or already-succeeded workflows. The error returns `reason` and `invocationIds`. Operator middleware records the failed attempt.	Inspect the returned reason and invocation ids before taking further action.	Depends on `reason`; use the retry blocked reasons table.

DAPI error responses include the normal request identifier for support correlation. Keep that identifier with the incident record and the operations audit row.

Audit, security, and production boundaries

Topic	Behavior
Permission boundary	All four routes require system operate permission and run through operator route middleware.
Audit log	Operator route attempts create an operations audit row with actor user id, role at time, route, sanitized arguments, reason when supplied, start time, finish time, outcome, and error on failure.
Credentialed URL redaction	If an argument contains a URL with username or password information, the stored audit argument replaces that user information with `REDACTED` and preserves the host, path, query, and fragment.
Tenant scope	The operations audit log is platform-global and keyed by operator actor, not tenant row ownership.
PCI scope	These recovery routes do not accept cardholder data. Do not put card data, credentials, secrets, or private keys in request bodies.
KYC scope	These routes do not perform KYC checks. They recover workflow infrastructure state and do not change identity verification status.
Idempotency	The routes do not document an idempotency key contract. Use doctor before and after each write route instead of blind retries.
Timestamps	Doctor timestamps use ISO 8601 UTC strings.
Availability	A healthy doctor response is operational evidence for this recovery surface. It is not an SLA, failover drill, or disaster recovery proof.

Durable Execution Engine recovery

On this page