Durable Execution Engine recovery
Recover Durable Execution Engine-backed DALP workflows with operator-only DAPI routes for health checks, redeployment, stale deployment cleanup, and stuck workflow recovery.
The Durable Execution Engine recovery routes sit behind the DALP operator API. They are for platform operators and trusted automation that already have system operate permission. They are not tenant-facing product routes.
Always run doctor first. It reads Durable Execution Engine ingress, admin API, deployment, service, and invocation health without changing workflow state. Use a write route only after the doctor response identifies the unhealthy component. Run doctor again after the write route to confirm the result.
DALP records each operator route attempt in the operations audit log. If a request includes a URL with username or password information, the audit row redacts that user information before storing the request arguments.
In prose, the recovery path is inspect, act once, then verify. force-redeploy handles service registration. cleanup-stale-deployments handles old deployment records. recover-stuck-workflow handles one failed workflow key after DALP verifies that retry is safe.
Choose the recovery action
Start from the doctor readout and change only the component that is unhealthy. This keeps the incident narrow and gives the next operator a clean audit trail.
| Doctor signal | Use this route | Why |
|---|---|---|
| The service URL is missing or the service needs to be registered again | force-redeploy | Registers the live durable workflow service URL and returns the new deployment id. |
| More than one deployment record exists and you know which service URL should stay active | cleanup-stale-deployments | Keeps the matching deployment and drains/deletes other deployment records. |
| A specific workflow key is terminal-failed and safe to replay | recover-stuck-workflow | Kills and purges terminal-failed invocations for that key, then clears keyed workflow state. |
| The admin API, DNS, or network path is unreachable | Fix connectivity first, then rerun doctor | The write routes depend on the same admin API path. |
| The workflow has an active invocation or already completed successfully | Do not clear state with this route | DALP returns RESTATE_WORKFLOW_RETRY_BLOCKED because replay would be unsafe or unnecessary. |
Prerequisites
Confirm these facts before you call a recovery route:
| Requirement | Value |
|---|---|
| API access | DALP admin API for the environment you operate. |
| Permission | System operate permission on the operator account or API key. |
| Authentication header | X-Api-Key: <operator api key>. |
| Service URL | Required for redeploy and cleanup. Use the exact deployments.items[].serviceUrl value from doctor when cleaning stale deployments. |
| Workflow identifiers | Required for workflow recovery. Use serviceName and serviceKey from a failed invocation record or workflow metadata. |
| Audit expectation | Every operator route writes an operations audit row with the actor, role, route, sanitized arguments, start time, finish time, outcome, and error when the call fails. |
Set the example environment variables:
export DALP_API_URL="https://platform.example.com"
export DALP_API_KEY="sm_dalp_operator_1234567890"Quickstart: inspect and re-register a workflow service
This quickstart uses one coherent recovery case: doctor reports the live durable workflow service URL, you re-register that URL, then you check doctor again.
1. Run doctor
curl -sS -X POST "$DALP_API_URL/api/v2/admin/operator/restate/doctor" \
-H "X-Api-Key: $DALP_API_KEY" \
-H "Content-Type: application/json" \
-d '{}'{
"ingress": { "status": "ok", "latencyMs": 12 },
"admin": { "status": "ok", "latencyMs": 9, "version": "1.4.0" },
"deployments": {
"status": "ok",
"items": [
{
"id": "dp_01j8m7k2q3r4s5t6u7v8w9x0y1",
"serviceUrl": "http://ddwf:9080/v1",
"createdAt": "2026-05-09T10:00:00.000Z"
}
],
"error": null
},
"services": {
"status": "ok",
"items": [{ "name": "IdentityRecoveryWorkflow", "revision": 3 }],
"error": null
},
"invocations": {
"status": "ok",
"byStatus": { "invoked": 2, "suspended": 1 },
"recentFailures": [
{
"id": "inv_01j8m7p2q3r4s5t6u7v8w9x0z2",
"serviceName": "IdentityRecoveryWorkflow",
"serviceKey": "invitation_01j8m7k2q3r4s5t6u7v8w9x0y1",
"failedAt": "2026-05-09T10:07:19.000Z",
"errorMessage": "TerminalError: upstream provider returned a terminal failure"
}
],
"error": null
}
}2. Re-register the service URL
Use the serviceUrl value from deployments.items. Keep the string exact, including scheme, host, port, path, and trailing slash if present.
curl -sS -X POST "$DALP_API_URL/api/v2/admin/operator/restate/force-redeploy" \
-H "X-Api-Key: $DALP_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"serviceUrl": "http://ddwf:9080/v1",
"force": true
}'{
"acknowledged": true,
"deploymentId": "dp_01j8m8b9c0d1e2f3g4h5j6k7m8"
}3. Verify with doctor
curl -sS -X POST "$DALP_API_URL/api/v2/admin/operator/restate/doctor" \
-H "X-Api-Key: $DALP_API_KEY" \
-H "Content-Type: application/json" \
-d '{}'{
"ingress": { "status": "ok", "latencyMs": 10 },
"admin": { "status": "ok", "latencyMs": 8, "version": "1.4.0" },
"deployments": {
"status": "ok",
"items": [
{
"id": "dp_01j8m8b9c0d1e2f3g4h5j6k7m8",
"serviceUrl": "http://ddwf:9080/v1",
"createdAt": "2026-05-09T10:10:00.000Z"
}
],
"error": null
},
"services": {
"status": "ok",
"items": [{ "name": "IdentityRecoveryWorkflow", "revision": 3 }],
"error": null
},
"invocations": {
"status": "ok",
"byStatus": { "invoked": 2, "suspended": 1 },
"recentFailures": [
{
"id": "inv_01j8m7p2q3r4s5t6u7v8w9x0z2",
"serviceName": "IdentityRecoveryWorkflow",
"serviceKey": "invitation_01j8m7k2q3r4s5t6u7v8w9x0y1",
"failedAt": "2026-05-09T10:07:19.000Z",
"errorMessage": "TerminalError: upstream provider returned a terminal failure"
}
],
"error": null
}
}Redeploy changes service registration. It does not clear invocation failures. If recentFailures still names a blocked workflow key, inspect that workflow before closing the incident.
If doctor still shows old deployment records beside the deployment you want to keep, continue with stale deployment cleanup.
Recovery order
| Step | Route | Purpose | State change |
|---|---|---|---|
| 1 | POST /api/v2/admin/operator/restate/doctor | Inspect current health. | None. |
| 2 | One write route | Fix the one unhealthy component doctor identified. | Depends on the route. |
| 3 | POST /api/v2/admin/operator/restate/doctor | Confirm the affected component moved to ok or that the remaining failure is understood. | None. |
Do not skip the first doctor call. The write routes share the admin API dependency, and a broken admin path makes the write routes fail or block.
Doctor
doctor is read-only. It probes the Durable Execution Engine ingress URL, admin API, deployment list, service list, and invocation table. A failed component marks only that component as degraded or unreachable. The route fails only when DALP cannot resolve the admin URL before probing.
POST /api/v2/admin/operator/restate/doctorRequest body
Send an empty JSON object.
{}Response fields
| Field | Type | Meaning |
|---|---|---|
ingress.status | ok, unreachable, or degraded | Ingress probe result. |
ingress.latencyMs | Number in milliseconds or null | Probe latency when measured. |
admin.status | ok, unreachable, or degraded | Admin API probe result. |
admin.latencyMs | Number in milliseconds or null | Admin API probe latency when measured. |
admin.version | String or null | Admin API version when returned. |
deployments.status | ok, unreachable, or degraded | Deployment list probe result. |
deployments.items[].id | String | Durable Execution Engine deployment id. |
deployments.items[].serviceUrl | URL string | Registered service URL. Preserve the exact value for cleanup. |
deployments.items[].createdAt | ISO 8601 UTC timestamp | Deployment creation time returned by the admin API. |
services.status | ok, unreachable, or degraded | Service list probe result. |
services.items[].name | String | Registered service name. |
services.items[].revision | Number or omitted | Service revision when returned. |
invocations.status | ok, unreachable, or degraded | Invocation table probe result. |
invocations.byStatus | Object keyed by invocation status | Invocation counts by status. |
invocations.recentFailures | Array, maximum 20 items | Recent failed invocations ordered by most recently modified first. |
invocations.recentFailures[].serviceKey | String or null | Workflow key for recovery. Use only when it is present. |
error on readout objects | String or null | Human-readable probe failure reason for that component. |
Force redeploy
force-redeploy registers the durable workflow service URL with the Durable Execution Engine admin API. It returns the deployment id for the registration. It does not drain or delete old deployments.
POST /api/v2/admin/operator/restate/force-redeployRequest body
| Field | Type | Required | Constraints and defaults |
|---|---|---|---|
serviceUrl | URL string | Yes | Must be the live durable workflow service URL that should be registered. Use configured service state or deployment metadata. Do not reuse an old doctor URL when you are replacing a stale deployment. Credentials in URLs are redacted in the operations audit log. |
force | Boolean | No | Defaults to true. When true, DALP sends a forced registration request to the admin API. |
{
"serviceUrl": "http://ddwf:9080/v1",
"force": true
}Success response
{
"acknowledged": true,
"deploymentId": "dp_01j8m8b9c0d1e2f3g4h5j6k7m8"
}Operational notes
- Run
doctorbefore this route so you can compare deployment state before and after registration. - Run
cleanup-stale-deploymentsafter redeploy when doctor still shows old deployment rows. - Do not treat
acknowledged: trueas stale deployment cleanup. It means registration completed.
Cleanup stale deployments
cleanup-stale-deployments keeps the deployment matching serviceUrl. DALP resolves that deployment id server-side, then drains and deletes every other registered deployment. A failure to list deployments returns an admin-unreachable error. Per-deployment drain or delete failures are logged by the cleanup helper and do not change the success response shape.
POST /api/v2/admin/operator/restate/cleanup-stale-deploymentsRequest body
| Field | Type | Required | Constraints and defaults |
|---|---|---|---|
serviceUrl | URL string | Yes | Must exactly match the registered service URL of the deployment to keep. Copy it from deployments.items[].serviceUrl in doctor. |
forceDrain | Boolean | No | Defaults to false. Use true only when stale deployments point to dead services and cannot drain normally. |
{
"serviceUrl": "http://ddwf:9080/v1",
"forceDrain": false
}Success response
{
"acknowledged": true
}Operational notes
forceDrain: falseis the normal production value. DALP drains stale deployments through the admin API, but it can still kill pinned invocations when a stale deployment is unreachable and cannot drain normally.forceDrain: trueis for a known dead stale deployment where you accept forced drain behavior even when pending invocations remain.- Run doctor after cleanup and inspect
deployments.itemsbefore you close the incident.
Recover stuck workflow
recover-stuck-workflow prepares one workflow key for retry. DALP queries prior invocations for the supplied (serviceName, serviceKey) pair, refuses unsafe recovery states, kills and purges terminal-failed invocations, then clears keyed workflow state. The next workflow submission starts from a blank state.
POST /api/v2/admin/operator/restate/recover-stuck-workflowRequest body
| Field | Type | Required | Constraints and defaults |
|---|---|---|---|
serviceName | String | Yes | Must match ^[a-zA-Z0-9_-]+$. Use the service name from doctor, failed invocation metadata, or workflow metadata. No default. |
serviceKey | String | Yes | Must match ^[a-zA-Z0-9_-]+$. Use the exact workflow key for the failed invocation. No default. |
{
"serviceName": "IdentityRecoveryWorkflow",
"serviceKey": "invitation_01j8m7k2q3r4s5t6u7v8w9x0y1"
}Success response
{
"acknowledged": true
}Blocked recovery response
DALP refuses recovery when the workflow has an active invocation or already succeeded. The error includes a reason and the relevant invocation ids.
{
"code": "RESTATE_WORKFLOW_RETRY_BLOCKED",
"message": "Workflow retry is blocked by an active invocation",
"data": {
"reason": "active-invocation",
"invocationIds": ["inv_01j8m7p2q3r4s5t6u7v8w9x0z2"]
}
}Retry blocked reasons
| Reason | What DALP observed | State change | Operator response | Retry semantics |
|---|---|---|---|---|
active-invocation | A matching run invocation is not terminal. DALP treats every status except completed, killed, and cancelled as active, including pending, scheduled, ready, running, suspended, backing-off, paused, and unknown future statuses. | DALP does not kill, purge, or clear state. | Inspect the returned invocation ids. Wait for completion or handle the active invocation through the Durable Execution Engine admin tooling before retrying. | Retry only after the active invocation is no longer active. |
already-succeeded | A prior matching invocation already completed successfully. | DALP does not kill, purge, or clear state. | Do not replay the workflow blindly. Investigate why the caller or UI still reports a stuck state. | Treat as terminal for recovery unless new evidence shows a different workflow key is stuck. |
purge-failed | DALP could not purge a terminal-failed invocation for a non-transport reason. | DALP may have completed earlier recovery steps before the purge failed. | Inspect admin logs and retry after the purge failure is resolved. | Retry after the purge condition is fixed. Transport failures surface as RESTATE_ADMIN_UNREACHABLE. |
query-failed | DALP could not query invocation state for a non-transport reason. | DALP does not clear workflow state. | Escalate the malformed or unexpected admin response. | Retry only after the query path is healthy. Transport failures surface as RESTATE_ADMIN_UNREACHABLE. |
Error reference
| Error | HTTP class | What DALP observed | State change and audit behavior | Operator response | Retry semantics |
|---|---|---|---|---|---|
| Missing system operate permission | Authorization failure | The caller is authenticated but not authorized for operator routes, or the API key lacks the required permission. | The recovery handler does not run. Authorization failure behavior follows the operator route authorization layer. | Use an operator account or API key with system operate permission. | Retry only with corrected credentials. |
RESTATE_ADMIN_UNREACHABLE | 5xx class | DALP could not resolve the admin URL or could not reach the Durable Execution Engine admin API. | The target recovery action is not acknowledged. Operator middleware records the failed attempt when the route reaches the operator handler. | Check admin URL configuration, DNS, network policy, and admin API health. Run doctor after the admin path recovers. | Retry after connectivity or configuration is fixed. |
RESTATE_DEPLOYMENT_NOT_FOUND | 404 class | The supplied serviceUrl does not map to a registered deployment when DALP needs that mapping. | DALP does not clean stale deployments because it cannot identify the deployment to keep. Operator middleware records the failed attempt. | Run doctor and retry with the exact registered serviceUrl. | Retry with the exact service URL from doctor. |
RESTATE_WORKFLOW_RETRY_BLOCKED | 409 class for active or succeeded workflow state | Workflow recovery found a condition that makes clearing state unsafe. | DALP does not clear state for active or already-succeeded workflows. The error returns reason and invocationIds. Operator middleware records the failed attempt. | Inspect the returned reason and invocation ids before taking further action. | Depends on reason; use the retry blocked reasons table. |
DAPI error responses include the normal request identifier for support correlation. Keep that identifier with the incident record and the operations audit row.
Audit, security, and production boundaries
| Topic | Behavior |
|---|---|
| Permission boundary | All four routes require system operate permission and run through operator route middleware. |
| Audit log | Operator route attempts create an operations audit row with actor user id, role at time, route, sanitized arguments, reason when supplied, start time, finish time, outcome, and error on failure. |
| Credentialed URL redaction | If an argument contains a URL with username or password information, the stored audit argument replaces that user information with REDACTED and preserves the host, path, query, and fragment. |
| Tenant scope | The operations audit log is platform-global and keyed by operator actor, not tenant row ownership. |
| PCI scope | These recovery routes do not accept cardholder data. Do not put card data, credentials, secrets, or private keys in request bodies. |
| KYC scope | These routes do not perform KYC checks. They recover workflow infrastructure state and do not change identity verification status. |
| Idempotency | The routes do not document an idempotency key contract. Use doctor before and after each write route instead of blind retries. |
| Timestamps | Doctor timestamps use ISO 8601 UTC strings. |
| Availability | A healthy doctor response is operational evidence for this recovery surface. It is not an SLA, failover drill, or disaster recovery proof. |