Fix stuck workflows with the operator API

Operator API routes for workflow engine health checks, service re-registration, stale deployment cleanup, and preparing stuck workflows for retry.

These routes let platform operators inspect Workflow Engine health and run narrow recovery steps without touching tenant-facing product flows. Start with the doctor route, choose one write route that matches the failing component, then run doctor again to confirm the result.

Call these endpoints from an authenticated account with the system operate permission. Use them from operator accounts and trusted automation only, not tenant-facing product flows.

Recovery order

Step	Route	Purpose
1	`POST /api/v2/admin/operator/workflow-engine/doctor`	Inspect ingress, health API, deployments, services, and invocation state without changing workflow state.
2	One write route	Re-register the service, remove stale deployments, or prepare one workflow key for retry.
3	`POST /api/v2/admin/operator/workflow-engine/doctor`	Verify that the affected component moved back to `ok` or that the remaining failure is understood.

The exact route path contains the current public API service segment. Treat that segment as an endpoint path, not as product terminology. Do not use it as a product name.

Component statuses

Doctor responses use the same status vocabulary for each component.

Status	Meaning
`ok`	The component responded and its payload matched the expected shape.
`degraded`	The component responded, but it returned an error status or an unexpected payload.
`unreachable`	DALP could not reach the component within the route timeout or could not complete the probe.

A degraded sub-check does not fail the whole doctor route. Read every check result before you choose a recovery step.

Doctor

doctor is the read-only entry point. It probes the workflow engine ingress URL, health API, deployment list, service list, and invocation table. Send an empty JSON object as the request body.

POST /api/v2/admin/operator/workflow-engine/doctor

Success response

{
  "ingress": { "status": "ok", "latencyMs": 12 },
  "admin": { "status": "ok", "latencyMs": 9, "version": "1.4.0" },
  "deployments": {
    "status": "ok",
    "items": [
      {
        "id": "dp_01j8m7k2q3r4s5t6u7v8w9x0y1",
        "serviceUrl": "https://workflow-service.example.com",
        "createdAt": "2026-05-09T10:00:00.000Z"
      }
    ],
    "error": null
  },
  "services": {
    "status": "ok",
    "items": [{ "name": "IdentityRecoveryWorkflow", "revision": 3 }],
    "error": null
  },
  "invocations": {
    "status": "ok",
    "byStatus": { "invoked": 2, "suspended": 1 },
    "recentFailures": [],
    "error": null
  }
}

recentFailures returns up to 20 recent failed invocations. Each item includes id, serviceName, serviceKey, failedAt, and errorMessage.

Force redeploy

force-redeploy registers the workflow service URL with the workflow engine health API. It does not remove old deployments. Run stale deployment cleanup when you see old deployment records in the doctor output.

POST /api/v2/admin/operator/workflow-engine/force-redeploy

Request body

Field	Type	Required	Description
`serviceUrl`	URL string	Yes	Service URL to register. Use the URL returned by doctor or the configured workflow service endpoint.
`force`	boolean	No	Defaults to `true`. Passes a forced registration request to the health API.

{
  "serviceUrl": "https://workflow-service.example.com",
  "force": true
}

Success response

{
  "acknowledged": true,
  "deploymentId": "dp_01j8m7k2q3r4s5t6u7v8w9x0y1"
}

Cleanup stale deployments

cleanup-stale-deployments keeps the deployment matching serviceUrl and drains every other registered deployment. Use this route after doctor or force redeploy confirms your active service URL.

POST /api/v2/admin/operator/workflow-engine/cleanup-stale-deployments

Request body

Field	Type	Required	Description
`serviceUrl`	URL string	Yes	Service URL of the deployment to keep. Every other registered deployment is treated as stale.
`forceDrain`	boolean	No	Defaults to `false`. Use `true` only when stale deployments point to dead services and cannot drain normally.

{
  "serviceUrl": "https://workflow-service.example.com",
  "forceDrain": false
}

Success response

{ "acknowledged": true }

If DALP cannot list deployments or reach the health API, the route returns a workflow-engine-unreachable error instead of acknowledging cleanup.

Recover stuck workflow

recover-stuck-workflow prepares one workflow key for retry. DALP kills and purges prior invocations for the supplied (serviceName, serviceKey) pair, then clears keyed workflow state so the next submission starts from a blank state.

POST /api/v2/admin/operator/workflow-engine/recover-stuck-workflow

Request body

Field	Type	Required	Description
`serviceName`	string	Yes	Workflow service name. Allowed characters: letters, digits, hyphens, and `_`.
`serviceKey`	string	Yes	Workflow service key. Allowed characters: letters, digits, hyphens, and `_`.

{
  "serviceName": "IdentityRecoveryWorkflow",
  "serviceKey": "invitation_01j8m7k2q3r4s5t6u7v8w9x0y1"
}

Success response

{ "acknowledged": true }

DALP refuses to clear a workflow when an active invocation is still running or when the previous invocation already succeeded. In that case, the route returns a structured retry-blocked error with reason and invocationIds. Inspect those fields before you retry or escalate.

Error conditions

Condition	Meaning	Operator response
Missing system operate permission	The caller is not authorised for operator routes.	Use an operator account or API key with the required permission.
Workflow engine health API unreachable	DALP could not resolve or reach the workflow engine health API.	Check admin connectivity, then rerun doctor.
Deployment not found	The supplied `serviceUrl` does not match a registered deployment when the route needs that mapping.	Run doctor and retry with the exact registered service URL.
Workflow retry blocked	Recovery found an active invocation, an already succeeded invocation, or a query or purge condition that prevents safe retry.	Inspect the returned `reason` and `invocationIds` before retrying or escalating.

workflow engine operator API

Recovery order

Component statuses

Doctor

Success response

Force redeploy

Request body

Success response

Cleanup stale deployments

Request body

Success response

Recover stuck workflow

Request body

Success response

Error conditions

On this page