# Durable Execution Engine recovery

Source: https://docs.settlemint.com/docs/developer-guides/operations/durable-execution-engine-recovery
Recover Durable Execution Engine-backed DALP workflows with operator-only DAPI routes for health checks, redeployment, stale deployment cleanup, and stuck workflow recovery.


The Durable Execution Engine recovery routes sit behind the DALP operator API. They are for platform operators and trusted automation that already have system operate permission. They are not tenant-facing product routes.

Always run `doctor` first. It reads Durable Execution Engine ingress, admin API, deployment, service, and invocation health without changing workflow state. Use a write route only after the doctor response identifies the unhealthy component. Run `doctor` again after the write route to confirm the result.

DALP records each operator route attempt in the operations audit log. If a request includes a URL with username or password information, the audit row redacts that user information before storing the request arguments.

<Mermaid
  chart="`
flowchart TD
Operator[&#x22;Operator with system operate permission&#x22;] --> DAPI[&#x22;DALP operator API&#x22;]
DAPI --> Doctor[&#x22;doctor&#x22;]
Doctor --> Health[&#x22;Ingress, admin API, deployments, services, invocations&#x22;]
Health --> Decision[&#x22;Choose one recovery action&#x22;]
Decision --> Redeploy[&#x22;force-redeploy&#x22;]
Decision --> Cleanup[&#x22;cleanup-stale-deployments&#x22;]
Decision --> Recover[&#x22;recover-stuck-workflow&#x22;]
Redeploy --> Verify[&#x22;Run doctor again&#x22;]
Cleanup --> Verify
Recover --> Verify
Verify --> Audit[&#x22;Audit row records success or failure&#x22;]
`"
/>

In prose, the recovery path is inspect, act once, then verify. `force-redeploy` handles service registration. `cleanup-stale-deployments` handles old deployment records. `recover-stuck-workflow` handles one failed workflow key after DALP verifies that retry is safe.

## Choose the recovery action [#choose-the-recovery-action]

Start from the doctor readout and change only the component that is unhealthy. This keeps the incident narrow and gives the next operator a clean audit trail.

| Doctor signal                                                                            | Use this route                              | Why                                                                                          |
| ---------------------------------------------------------------------------------------- | ------------------------------------------- | -------------------------------------------------------------------------------------------- |
| The service URL is missing or the service needs to be registered again                   | `force-redeploy`                            | Registers the live durable workflow service URL and returns the new deployment id.           |
| More than one deployment record exists and you know which service URL should stay active | `cleanup-stale-deployments`                 | Keeps the matching deployment and drains/deletes other deployment records.                   |
| A specific workflow key is terminal-failed and safe to replay                            | `recover-stuck-workflow`                    | Kills and purges terminal-failed invocations for that key, then clears keyed workflow state. |
| The admin API, DNS, or network path is unreachable                                       | Fix connectivity first, then rerun `doctor` | The write routes depend on the same admin API path.                                          |
| The workflow has an active invocation or already completed successfully                  | Do not clear state with this route          | DALP returns `RESTATE_WORKFLOW_RETRY_BLOCKED` because replay would be unsafe or unnecessary. |

## Prerequisites [#prerequisites]

Confirm these facts before you call a recovery route:

| Requirement           | Value                                                                                                                                                                  |
| --------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| API access            | DALP admin API for the environment you operate.                                                                                                                        |
| Permission            | System operate permission on the operator account or API key.                                                                                                          |
| Authentication header | `X-Api-Key: <operator api key>`.                                                                                                                                       |
| Service URL           | Required for redeploy and cleanup. Use the exact `deployments.items[].serviceUrl` value from doctor when cleaning stale deployments.                                   |
| Workflow identifiers  | Required for workflow recovery. Use `serviceName` and `serviceKey` from a failed invocation record or workflow metadata.                                               |
| Audit expectation     | Every operator route writes an operations audit row with the actor, role, route, sanitized arguments, start time, finish time, outcome, and error when the call fails. |

Set the example environment variables:

```bash
export DALP_API_URL="https://platform.example.com"
export DALP_API_KEY="sm_dalp_operator_1234567890"
```

## Quickstart: inspect and re-register a workflow service [#quickstart-inspect-and-re-register-a-workflow-service]

This quickstart uses one coherent recovery case: doctor reports the live durable workflow service URL, you re-register that URL, then you check doctor again.

### 1. Run doctor [#1-run-doctor]

```bash
curl -sS -X POST "$DALP_API_URL/api/v2/admin/operator/restate/doctor" \
  -H "X-Api-Key: $DALP_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{}'
```

```json
{
  "ingress": { "status": "ok", "latencyMs": 12 },
  "admin": { "status": "ok", "latencyMs": 9, "version": "1.4.0" },
  "deployments": {
    "status": "ok",
    "items": [
      {
        "id": "dp_01j8m7k2q3r4s5t6u7v8w9x0y1",
        "serviceUrl": "http://ddwf:9080/v1",
        "createdAt": "2026-05-09T10:00:00.000Z"
      }
    ],
    "error": null
  },
  "services": {
    "status": "ok",
    "items": [{ "name": "IdentityRecoveryWorkflow", "revision": 3 }],
    "error": null
  },
  "invocations": {
    "status": "ok",
    "byStatus": { "invoked": 2, "suspended": 1 },
    "recentFailures": [
      {
        "id": "inv_01j8m7p2q3r4s5t6u7v8w9x0z2",
        "serviceName": "IdentityRecoveryWorkflow",
        "serviceKey": "invitation_01j8m7k2q3r4s5t6u7v8w9x0y1",
        "failedAt": "2026-05-09T10:07:19.000Z",
        "errorMessage": "TerminalError: upstream provider returned a terminal failure"
      }
    ],
    "error": null
  }
}
```

### 2. Re-register the service URL [#2-re-register-the-service-url]

Use the `serviceUrl` value from `deployments.items`. Keep the string exact, including scheme, host, port, path, and trailing slash if present.

```bash
curl -sS -X POST "$DALP_API_URL/api/v2/admin/operator/restate/force-redeploy" \
  -H "X-Api-Key: $DALP_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "serviceUrl": "http://ddwf:9080/v1",
    "force": true
  }'
```

```json
{
  "acknowledged": true,
  "deploymentId": "dp_01j8m8b9c0d1e2f3g4h5j6k7m8"
}
```

### 3. Verify with doctor [#3-verify-with-doctor]

```bash
curl -sS -X POST "$DALP_API_URL/api/v2/admin/operator/restate/doctor" \
  -H "X-Api-Key: $DALP_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{}'
```

```json
{
  "ingress": { "status": "ok", "latencyMs": 10 },
  "admin": { "status": "ok", "latencyMs": 8, "version": "1.4.0" },
  "deployments": {
    "status": "ok",
    "items": [
      {
        "id": "dp_01j8m8b9c0d1e2f3g4h5j6k7m8",
        "serviceUrl": "http://ddwf:9080/v1",
        "createdAt": "2026-05-09T10:10:00.000Z"
      }
    ],
    "error": null
  },
  "services": {
    "status": "ok",
    "items": [{ "name": "IdentityRecoveryWorkflow", "revision": 3 }],
    "error": null
  },
  "invocations": {
    "status": "ok",
    "byStatus": { "invoked": 2, "suspended": 1 },
    "recentFailures": [
      {
        "id": "inv_01j8m7p2q3r4s5t6u7v8w9x0z2",
        "serviceName": "IdentityRecoveryWorkflow",
        "serviceKey": "invitation_01j8m7k2q3r4s5t6u7v8w9x0y1",
        "failedAt": "2026-05-09T10:07:19.000Z",
        "errorMessage": "TerminalError: upstream provider returned a terminal failure"
      }
    ],
    "error": null
  }
}
```

Redeploy changes service registration. It does not clear invocation failures. If `recentFailures` still names a blocked workflow key, inspect that workflow before closing the incident.

If doctor still shows old deployment records beside the deployment you want to keep, continue with stale deployment cleanup.

## Recovery order [#recovery-order]

| Step | Route                                        | Purpose                                                                                   | State change          |
| ---- | -------------------------------------------- | ----------------------------------------------------------------------------------------- | --------------------- |
| 1    | `POST /api/v2/admin/operator/restate/doctor` | Inspect current health.                                                                   | None.                 |
| 2    | One write route                              | Fix the one unhealthy component doctor identified.                                        | Depends on the route. |
| 3    | `POST /api/v2/admin/operator/restate/doctor` | Confirm the affected component moved to `ok` or that the remaining failure is understood. | None.                 |

Do not skip the first doctor call. The write routes share the admin API dependency, and a broken admin path makes the write routes fail or block.

## Doctor [#doctor]

`doctor` is read-only. It probes the Durable Execution Engine ingress URL, admin API, deployment list, service list, and invocation table. A failed component marks only that component as `degraded` or `unreachable`. The route fails only when DALP cannot resolve the admin URL before probing.

```http
POST /api/v2/admin/operator/restate/doctor
```

### Request body [#request-body]

Send an empty JSON object.

```json
{}
```

### Response fields [#response-fields]

| Field                                     | Type                               | Meaning                                                            |
| ----------------------------------------- | ---------------------------------- | ------------------------------------------------------------------ |
| `ingress.status`                          | `ok`, `unreachable`, or `degraded` | Ingress probe result.                                              |
| `ingress.latencyMs`                       | Number in milliseconds or `null`   | Probe latency when measured.                                       |
| `admin.status`                            | `ok`, `unreachable`, or `degraded` | Admin API probe result.                                            |
| `admin.latencyMs`                         | Number in milliseconds or `null`   | Admin API probe latency when measured.                             |
| `admin.version`                           | String or `null`                   | Admin API version when returned.                                   |
| `deployments.status`                      | `ok`, `unreachable`, or `degraded` | Deployment list probe result.                                      |
| `deployments.items[].id`                  | String                             | Durable Execution Engine deployment id.                            |
| `deployments.items[].serviceUrl`          | URL string                         | Registered service URL. Preserve the exact value for cleanup.      |
| `deployments.items[].createdAt`           | ISO 8601 UTC timestamp             | Deployment creation time returned by the admin API.                |
| `services.status`                         | `ok`, `unreachable`, or `degraded` | Service list probe result.                                         |
| `services.items[].name`                   | String                             | Registered service name.                                           |
| `services.items[].revision`               | Number or omitted                  | Service revision when returned.                                    |
| `invocations.status`                      | `ok`, `unreachable`, or `degraded` | Invocation table probe result.                                     |
| `invocations.byStatus`                    | Object keyed by invocation status  | Invocation counts by status.                                       |
| `invocations.recentFailures`              | Array, maximum 20 items            | Recent failed invocations ordered by most recently modified first. |
| `invocations.recentFailures[].serviceKey` | String or `null`                   | Workflow key for recovery. Use only when it is present.            |
| `error` on readout objects                | String or `null`                   | Human-readable probe failure reason for that component.            |

## Force redeploy [#force-redeploy]

`force-redeploy` registers the durable workflow service URL with the Durable Execution Engine admin API. It returns the deployment id for the registration. It does not drain or delete old deployments.

```http
POST /api/v2/admin/operator/restate/force-redeploy
```

### Request body [#request-body-1]

| Field        | Type       | Required | Constraints and defaults                                                                                                                                                                                                                                              |
| ------------ | ---------- | -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `serviceUrl` | URL string | Yes      | Must be the live durable workflow service URL that should be registered. Use configured service state or deployment metadata. Do not reuse an old doctor URL when you are replacing a stale deployment. Credentials in URLs are redacted in the operations audit log. |
| `force`      | Boolean    | No       | Defaults to `true`. When `true`, DALP sends a forced registration request to the admin API.                                                                                                                                                                           |

```json
{
  "serviceUrl": "http://ddwf:9080/v1",
  "force": true
}
```

### Success response [#success-response]

```json
{
  "acknowledged": true,
  "deploymentId": "dp_01j8m8b9c0d1e2f3g4h5j6k7m8"
}
```

### Operational notes [#operational-notes]

* Run `doctor` before this route so you can compare deployment state before and after registration.
* Run `cleanup-stale-deployments` after redeploy when doctor still shows old deployment rows.
* Do not treat `acknowledged: true` as stale deployment cleanup. It means registration completed.

## Cleanup stale deployments [#cleanup-stale-deployments]

`cleanup-stale-deployments` keeps the deployment matching `serviceUrl`. DALP resolves that deployment id server-side, then drains and deletes every other registered deployment. A failure to list deployments returns an admin-unreachable error. Per-deployment drain or delete failures are logged by the cleanup helper and do not change the success response shape.

```http
POST /api/v2/admin/operator/restate/cleanup-stale-deployments
```

### Request body [#request-body-2]

| Field        | Type       | Required | Constraints and defaults                                                                                                          |
| ------------ | ---------- | -------- | --------------------------------------------------------------------------------------------------------------------------------- |
| `serviceUrl` | URL string | Yes      | Must exactly match the registered service URL of the deployment to keep. Copy it from `deployments.items[].serviceUrl` in doctor. |
| `forceDrain` | Boolean    | No       | Defaults to `false`. Use `true` only when stale deployments point to dead services and cannot drain normally.                     |

```json
{
  "serviceUrl": "http://ddwf:9080/v1",
  "forceDrain": false
}
```

### Success response [#success-response-1]

```json
{
  "acknowledged": true
}
```

### Operational notes [#operational-notes-1]

* `forceDrain: false` is the normal production value. DALP drains stale deployments through the admin API, but it can still kill pinned invocations when a stale deployment is unreachable and cannot drain normally.
* `forceDrain: true` is for a known dead stale deployment where you accept forced drain behavior even when pending invocations remain.
* Run doctor after cleanup and inspect `deployments.items` before you close the incident.

## Recover stuck workflow [#recover-stuck-workflow]

`recover-stuck-workflow` prepares one workflow key for retry. DALP queries prior invocations for the supplied `(serviceName, serviceKey)` pair, refuses unsafe recovery states, kills and purges terminal-failed invocations, then clears keyed workflow state. The next workflow submission starts from a blank state.

```http
POST /api/v2/admin/operator/restate/recover-stuck-workflow
```

### Request body [#request-body-3]

| Field         | Type   | Required | Constraints and defaults                                                                                                       |
| ------------- | ------ | -------- | ------------------------------------------------------------------------------------------------------------------------------ |
| `serviceName` | String | Yes      | Must match `^[a-zA-Z0-9_-]+$`. Use the service name from doctor, failed invocation metadata, or workflow metadata. No default. |
| `serviceKey`  | String | Yes      | Must match `^[a-zA-Z0-9_-]+$`. Use the exact workflow key for the failed invocation. No default.                               |

```json
{
  "serviceName": "IdentityRecoveryWorkflow",
  "serviceKey": "invitation_01j8m7k2q3r4s5t6u7v8w9x0y1"
}
```

### Success response [#success-response-2]

```json
{
  "acknowledged": true
}
```

### Blocked recovery response [#blocked-recovery-response]

DALP refuses recovery when the workflow has an active invocation or already succeeded. The error includes a `reason` and the relevant invocation ids.

```json
{
  "code": "RESTATE_WORKFLOW_RETRY_BLOCKED",
  "message": "Workflow retry is blocked by an active invocation",
  "data": {
    "reason": "active-invocation",
    "invocationIds": ["inv_01j8m7p2q3r4s5t6u7v8w9x0z2"]
  }
}
```

### Retry blocked reasons [#retry-blocked-reasons]

| Reason              | What DALP observed                                                                                                                                                                                                                                      | State change                                                            | Operator response                                                                                                                                            | Retry semantics                                                                                        |
| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------ |
| `active-invocation` | A matching `run` invocation is not terminal. DALP treats every status except `completed`, `killed`, and `cancelled` as active, including `pending`, `scheduled`, `ready`, `running`, `suspended`, `backing-off`, `paused`, and unknown future statuses. | DALP does not kill, purge, or clear state.                              | Inspect the returned invocation ids. Wait for completion or handle the active invocation through the Durable Execution Engine admin tooling before retrying. | Retry only after the active invocation is no longer active.                                            |
| `already-succeeded` | A prior matching invocation already completed successfully.                                                                                                                                                                                             | DALP does not kill, purge, or clear state.                              | Do not replay the workflow blindly. Investigate why the caller or UI still reports a stuck state.                                                            | Treat as terminal for recovery unless new evidence shows a different workflow key is stuck.            |
| `purge-failed`      | DALP could not purge a terminal-failed invocation for a non-transport reason.                                                                                                                                                                           | DALP may have completed earlier recovery steps before the purge failed. | Inspect admin logs and retry after the purge failure is resolved.                                                                                            | Retry after the purge condition is fixed. Transport failures surface as `RESTATE_ADMIN_UNREACHABLE`.   |
| `query-failed`      | DALP could not query invocation state for a non-transport reason.                                                                                                                                                                                       | DALP does not clear workflow state.                                     | Escalate the malformed or unexpected admin response.                                                                                                         | Retry only after the query path is healthy. Transport failures surface as `RESTATE_ADMIN_UNREACHABLE`. |

## Error reference [#error-reference]

| Error                             | HTTP class                                       | What DALP observed                                                                                                | State change and audit behavior                                                                                                                                      | Operator response                                                                                                   | Retry semantics                                           |
| --------------------------------- | ------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------- |
| Missing system operate permission | Authorization failure                            | The caller is authenticated but not authorized for operator routes, or the API key lacks the required permission. | The recovery handler does not run. Authorization failure behavior follows the operator route authorization layer.                                                    | Use an operator account or API key with system operate permission.                                                  | Retry only with corrected credentials.                    |
| `RESTATE_ADMIN_UNREACHABLE`       | 5xx class                                        | DALP could not resolve the admin URL or could not reach the Durable Execution Engine admin API.                   | The target recovery action is not acknowledged. Operator middleware records the failed attempt when the route reaches the operator handler.                          | Check admin URL configuration, DNS, network policy, and admin API health. Run doctor after the admin path recovers. | Retry after connectivity or configuration is fixed.       |
| `RESTATE_DEPLOYMENT_NOT_FOUND`    | 404 class                                        | The supplied `serviceUrl` does not map to a registered deployment when DALP needs that mapping.                   | DALP does not clean stale deployments because it cannot identify the deployment to keep. Operator middleware records the failed attempt.                             | Run doctor and retry with the exact registered `serviceUrl`.                                                        | Retry with the exact service URL from doctor.             |
| `RESTATE_WORKFLOW_RETRY_BLOCKED`  | 409 class for active or succeeded workflow state | Workflow recovery found a condition that makes clearing state unsafe.                                             | DALP does not clear state for active or already-succeeded workflows. The error returns `reason` and `invocationIds`. Operator middleware records the failed attempt. | Inspect the returned reason and invocation ids before taking further action.                                        | Depends on `reason`; use the retry blocked reasons table. |

DAPI error responses include the normal request identifier for support correlation. Keep that identifier with the incident record and the operations audit row.

## Audit, security, and production boundaries [#audit-security-and-production-boundaries]

| Topic                      | Behavior                                                                                                                                                                                           |
| -------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Permission boundary        | All four routes require system operate permission and run through operator route middleware.                                                                                                       |
| Audit log                  | Operator route attempts create an operations audit row with actor user id, role at time, route, sanitized arguments, reason when supplied, start time, finish time, outcome, and error on failure. |
| Credentialed URL redaction | If an argument contains a URL with username or password information, the stored audit argument replaces that user information with `REDACTED` and preserves the host, path, query, and fragment.   |
| Tenant scope               | The operations audit log is platform-global and keyed by operator actor, not tenant row ownership.                                                                                                 |
| PCI scope                  | These recovery routes do not accept cardholder data. Do not put card data, credentials, secrets, or private keys in request bodies.                                                                |
| KYC scope                  | These routes do not perform KYC checks. They recover workflow infrastructure state and do not change identity verification status.                                                                 |
| Idempotency                | The routes do not document an idempotency key contract. Use doctor before and after each write route instead of blind retries.                                                                     |
| Timestamps                 | Doctor timestamps use ISO 8601 UTC strings.                                                                                                                                                        |
| Availability               | A healthy doctor response is operational evidence for this recovery surface. It is not an SLA, failover drill, or disaster recovery proof.                                                         |

## Related pages [#related-pages]

* [Durable Execution Engine operator API reference](/docs/developer-guides/api-integration/durable-execution-engine-operator-api)
* [DAPI error reference](/docs/developer-guides/api-integration/dapi-error-reference)
* [Authorization architecture](/docs/architecture/security/authorization)
* [Blockchain monitoring operations](/docs/developer-guides/operations/blockchain-monitoring)
* [Transaction tracking operations](/docs/developer-guides/operations/transaction-tracking)