Restate workflow recovery
Recover Restate-backed DALP workflows with operator-only DAPI routes for health checks, redeployment, stale deployment cleanup, and stuck workflow recovery.
Run the doctor route first. It reports Restate ingress, admin API, deployment, service, and invocation health without changing workflow state.
Use the write routes only after you know which condition you are fixing. Each route requires an authenticated operator with the system operate permission. DALP records operator route attempts for audit, including the actor, route, request arguments, reason when supplied, outcome, and error message when a route fails.
Before you start
Confirm you have:
- access to the DALP admin API for the environment you are operating
- an operator account or API key with system operate permission
- the Restate service URL used by the durable workflow service, when you need to redeploy or clean deployments
- the Restate workflow
serviceNameandserviceKey, when you need to clear a blocked workflow
Set a base URL for the examples:
export DALP_API_URL="https://your-platform.example.com"
export DALP_API_KEY="sm_dalp_xxxxxxxxxxxxxxxx"1. Check workflow health
Run the doctor route first:
curl -X POST "$DALP_API_URL/api/v2/admin/operator/restate/doctor" \
-H "X-Api-Key: $DALP_API_KEY" \
-H "Content-Type: application/json" \
-d '{}'The response has five independent components:
| Component | What it tells you |
|---|---|
ingress | Whether the Restate ingress URL responds and how long it took. |
admin | Whether the Restate admin API responds and which version it reports. |
deployments | Registered Restate deployments and their service URLs. |
services | Registered Restate service names and revisions. |
invocations | Invocation counts by status and up to 20 recent failures. |
Each component has its own status: ok, unreachable, or degraded. A failed sub-check degrades only that component, so read the whole response before choosing a recovery action.
Example response shape:
{
"ingress": { "status": "ok", "latencyMs": 12 },
"admin": { "status": "ok", "latencyMs": 9, "version": "1.4.0" },
"deployments": {
"status": "ok",
"items": [
{
"id": "dp_01j...",
"serviceUrl": "http://ddwf:9080",
"createdAt": "2026-05-09T10:00:00.000Z"
}
],
"error": null
},
"services": { "status": "ok", "items": [{ "name": "IdentityRecoveryWorkflow", "revision": 3 }], "error": null },
"invocations": {
"status": "ok",
"byStatus": { "invoked": 2, "suspended": 1 },
"recentFailures": [],
"error": null
}
}2. Re-register the durable workflow service
Use force-redeploy when the service needs to be registered again with Restate.
curl -X POST "$DALP_API_URL/api/v2/admin/operator/restate/force-redeploy" \
-H "X-Api-Key: $DALP_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"serviceUrl": "http://ddwf:9080",
"force": true
}'A successful response acknowledges the registration and returns the Restate deployment id:
{
"acknowledged": true,
"deploymentId": "dp_01j..."
}force defaults to true. This route does not remove old deployments. If doctor still shows old deployments after redeploying, run stale deployment cleanup next.
3. Remove stale deployments
Use cleanup-stale-deployments after you know which service URL should remain active. DALP keeps the deployment matching serviceUrl and drains every other registered deployment.
curl -X POST "$DALP_API_URL/api/v2/admin/operator/restate/cleanup-stale-deployments" \
-H "X-Api-Key: $DALP_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"serviceUrl": "http://ddwf:9080",
"forceDrain": false
}'A successful response is:
{ "acknowledged": true }Use forceDrain: true only when stale deployments point to dead services and still have pending invocations that cannot drain normally. Confirm the result with doctor or the Restate admin deployment list.
4. Clear a blocked workflow for retry
Use recover-stuck-workflow when a specific workflow key is wedged and must start again from a blank state.
curl -X POST "$DALP_API_URL/api/v2/admin/operator/restate/recover-stuck-workflow" \
-H "X-Api-Key: $DALP_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"serviceName": "InvitationWorkflow",
"serviceKey": "invitation_01j8m7k2q3r4s5t6u7v8w9x0y1_1770000000000"
}'serviceName and serviceKey may contain letters, numbers, underscores, and hyphens. A successful response is:
{ "acknowledged": true }This route kills and purges prior invocations for the supplied workflow key, then clears keyed workflow state. The next workflow submission starts from a blank state.
DALP refuses to clear a workflow when recovery would hide a still-active invocation or a workflow that already succeeded. In that case, the route returns RESTATE_WORKFLOW_RETRY_BLOCKED with a structured reason and the relevant invocation ids.
Recovery error codes
| Error code | Meaning | What to do |
|---|---|---|
OFFCHAIN_USER_PERMISSION_REQUIRED | The caller does not have system operate permission. | Use an operator account or update the caller's role before retrying. |
RESTATE_ADMIN_UNREACHABLE | DALP could not resolve or reach the Restate admin API. | Check Restate admin connectivity, then rerun doctor. |
RESTATE_DEPLOYMENT_NOT_FOUND | Cleanup or redeploy could not find the expected deployment for the supplied service URL. | Run doctor and use the exact registered service URL. |
RESTATE_WORKFLOW_RETRY_BLOCKED | Workflow recovery found an active invocation, an already succeeded invocation, or a purge/query condition that prevents safe retry. | Inspect the returned reason and invocationIds before deciding whether to retry later or investigate manually. |