# Observability

Source: https://docs.settlemint.com/docs/architects/operability/observability
The observability stack provides platform visibility through metrics
collection, log aggregation, distributed tracing, and a connected Grafana
dashboard system with a single-glance platform health view, for deployments
that enable the observability chart.


## Overview [#overview]

DALP observability is the deployment telemetry layer for self-hosted environments that enable the observability chart. The stack collects metrics, structured logs, and distributed traces from platform components. You use those signals to inspect health, investigate incidents, and connect application behaviour to infrastructure state.

[Blockchain monitoring](/docs/developers/operations/blockchain-monitoring) covers chain RPC and indexer diagnostics exposed through the Platform API, the Console, and the CLI. The telemetry stack handles deployment-level concerns: collection, dashboard visualization, alert routing, and log or trace inspection.

This stack does not replace audit logs, compliance reports, or custody records. If you are a security reviewer, treat it as supporting evidence for operational visibility, not as the authoritative record for regulated activity, privileged operations, or custody decisions. During incidents, it helps you answer four questions: what changed, which component emitted the signal, where to look next, and which environment is affected.

## Three pillars [#three-pillars]

<Mermaid
  chart="`flowchart TB
  subgraph SOURCES[&#x22;Platform components&#x22;]
    UX(User Experience)
    ORCH(Orchestration)
    TX(Transaction Mgmt)
    BC(Blockchain Infra)
  end

  subgraph COLLECT[&#x22;Collection&#x22;]
    METRICS(Metrics Agent)
    LOGS(Log Collector)
    TRACES(Trace Agent)
  end

  subgraph STORE[&#x22;Storage&#x22;]
    TSDB(Time Series DB)
    LOGSDB(Log Storage)
    TRACEDB(Trace Storage)
  end

  subgraph VIEW[&#x22;Visualization&#x22;]
    DASH(Dashboards)
    ALERT(Alerting)
    SEARCH(Log Search)
  end

  UX --> METRICS
  UX --> LOGS
  UX --> TRACES
  ORCH --> METRICS
  ORCH --> LOGS
  ORCH --> TRACES
  TX --> METRICS
  TX --> LOGS
  TX --> TRACES
  BC --> METRICS
  BC --> LOGS
  BC --> TRACES

  METRICS --> TSDB
  LOGS --> LOGSDB
  TRACES --> TRACEDB

  TSDB --> DASH
  TSDB --> ALERT
  LOGSDB --> SEARCH
  TRACEDB --> DASH

  style UX fill:#5fc9bf,stroke:#3a9d96,stroke-width:2px,color:#fff
  style ORCH fill:#6ba4d4,stroke:#4a7ba8,stroke-width:2px,color:#fff
  style TX fill:#8571d9,stroke:#654bad,stroke-width:2px,color:#fff
  style BC fill:#b661d9,stroke:#8a3fb3,stroke-width:2px,color:#fff
  style METRICS fill:#5fc9bf,stroke:#3a9d96,stroke-width:2px,color:#fff
  style LOGS fill:#5fc9bf,stroke:#3a9d96,stroke-width:2px,color:#fff
  style TRACES fill:#5fc9bf,stroke:#3a9d96,stroke-width:2px,color:#fff
  style TSDB fill:#6ba4d4,stroke:#4a7ba8,stroke-width:2px,color:#fff
  style LOGSDB fill:#6ba4d4,stroke:#4a7ba8,stroke-width:2px,color:#fff
  style TRACEDB fill:#6ba4d4,stroke:#4a7ba8,stroke-width:2px,color:#fff
  style DASH fill:#8571d9,stroke:#654bad,stroke-width:2px,color:#fff
  style ALERT fill:#8571d9,stroke:#654bad,stroke-width:2px,color:#fff
  style SEARCH fill:#8571d9,stroke:#654bad,stroke-width:2px,color:#fff`"
/>

![API monitoring overview](/docs/screenshots/monitoring/api-monitoring-overview.webp)

### Metrics [#metrics]

Time-series metrics capture quantitative measurements over time. Counters track cumulative events, gauges measure current state, and histograms represent latency distributions and resource use.

| Metric category  | Examples                                      | Use case                   |
| ---------------- | --------------------------------------------- | -------------------------- |
| Request metrics  | Request counts, 4xx or 5xx rates, p95 latency | API performance monitoring |
| Resource metrics | CPU, memory, connections                      | Capacity planning          |
| Business metrics | Transactions, assets, users                   | Operational reporting      |
| Chain metrics    | Block lag, block age, finality lag, RPC state | Blockchain health triage   |
| Indexer metrics  | Sync failures, handler errors, backfill state | Live indexing triage       |

Platform API monitoring summaries aggregate request rollups by status class and return total requests, 4xx and 5xx counts, average duration, and p95 duration for the selected time range. Platform status endpoints roll up data freshness, transaction infrastructure, API activity, workflow execution, stat cards, and recent severity history for operator dashboards.

Blockchain monitoring summaries read the latest health snapshots per service and expose chain ID, network name, service type, latest status, sync lag, block height, block age, finality lag, stall duration, and recent collector latencies.

### Logs [#logs]

Structured logs capture discrete events with context that operators can query. Correlation identifiers link related log entries across components.

DALP redacts common credential and token shapes before log records are written to configured sinks. Covered values include SettleMint access tokens, bearer tokens, private keys, provider access keys, webhook or integration tokens, RPC URLs that contain embedded keys, and email addresses. Redaction applies to log messages and structured fields, so you can use logs for debugging without intentionally storing those values.

### Traces [#traces]

Distributed traces follow operations across component boundaries. Spans capture timing and metadata for each step. Trace visualization reveals bottlenecks and failure points in complex operations.

## Dashboard areas [#dashboard-areas]

Grafana dashboards can cover these monitoring areas when the observability stack and relevant exporters are enabled:

| Dashboard area        | Audience            | Example signals                                |
| --------------------- | ------------------- | ---------------------------------------------- |
| Operations overview   | Platform operators  | Request rates, error rates, latency            |
| Transaction monitor   | Operations team     | Pending transactions, gas usage, confirmations |
| Compliance activity   | Compliance officers | Verification volumes, approval rates           |
| Security overview     | Security team       | Authentication events, access patterns         |
| Infrastructure health | DevOps              | Resource utilization, node health              |

### Single-glance health and navigation [#single-glance-health-and-navigation]

The shipped dashboard set is built as one connected system rather than a loose collection of panels. It gives operators a single place to answer "is the platform healthy right now?" and a clear path to drill into the component that is broken, slow, or degraded.

The platform home dashboard shows one status tile per core service: Platform API, Console, Ledger Index, Workflow Engine, the webhook delivery queue, chain sync and node connectivity, the database connection pool, and the operator wallet balance. It also carries a live list of firing and pending alerts, error and saturation trends grouped by namespace, and a combined error-log stream. One screen tells you which area to look at first.

The alerts overview dashboard lists the active critical and warning alerts, the routing model that decides where each one is sent, and a timeline of recent alert state changes. Use it to see what is firing and where each alert routes.

Per-service dashboards open with a health row and end with a logs row, but the sections in between are specific to that service. The Platform API dashboard covers traffic, latency, error breakdown, cache behaviour, and onboarding flows. The Workflow Engine dashboard covers registrations, exchange-rate schedules, startup timing, failed executions, and handler traces. The Ledger Index dashboard covers indexing lag, RPC health, handler state, reorg detection, and backfill progress. Once the home view points you at a service, its dashboard is organized for that component's operations, with links into the log and trace backends for deeper inspection.

Every dashboard carries the same navigation bar: a platform-services menu, an infrastructure menu, and direct links back to the home and alerts views. You can move from the at-a-glance view into a specific service or infrastructure dashboard and back without leaving the system.

Use Grafana for deployment telemetry: cluster resource use, request behaviour, log search, trace inspection, and alert context. Use [blockchain monitoring](/docs/developers/operations/blockchain-monitoring) when the question is about a specific chain RPC or indexer service. That guide exposes the current service status, sync lag, block age, finality lag, stall duration, reindex state, raw health snapshots, timeline buckets, and live health events through the Platform API and CLI.

Before handoff, check each public operator surface from its own owner page instead of treating one dashboard as complete observability. This page is the routing map. The linked pages carry the detailed setup instructions, endpoint reference, and operating guidance.

| Surface to validate            | Start with                                                                                         | What to confirm                                                                                                       |
| ------------------------------ | -------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------- |
| Ingress and load balancer path | [Operational integration patterns](/docs/api-reference/reference/operational-integration-patterns) | The public route preserves the validated client IP before traffic reaches DALP.                                       |
| API usage and request health   | [API monitoring endpoints](/docs/api-reference/observability/api-monitoring)                       | Request volume, 4xx and 5xx rates, latency, endpoint metrics, request logs, and live stream work.                     |
| Platform status rollup         | [Platform status endpoints](/docs/api-reference/observability/platform-status)                     | Header verdict, data freshness, transactions, API activity, workflows, stat cards, and severity history are readable. |
| Chain RPC and indexer health   | [Blockchain monitoring](/docs/developers/operations/blockchain-monitoring)                         | Chain RPC freshness, indexer sync lag, reindex state, snapshots, and live health events resolve.                      |
| Helm observability stack       | This page and [self-hosting prerequisites](/docs/architects/self-hosting/prerequisites)            | The target environment enables the approved observability endpoint or the in-cluster chart.                           |
| Grafana dashboards and alerts  | This page                                                                                          | Dashboards, log search, trace inspection, alert labels, and routing identify the affected cluster.                    |
| Regulated records              | Audit logs, compliance reports, custody records, or the relevant business-flow page                | Observability supports triage but does not replace the authoritative record for regulated activity.                   |

![Detailed API request logs](/docs/screenshots/monitoring/api-monitoring-request-log.webp)

## Deployable components [#deployable-components]

The observability Helm chart can deploy the telemetry components used by self-hosted environments. The chart includes VictoriaMetrics for metrics storage, Grafana Alloy for telemetry collection, metrics-server and kube-state-metrics for Kubernetes resource and object metrics, Grafana for dashboards, Loki for logs, Prometheus node exporter for host metrics, and Tempo for traces.

Local development configurations enable the stack by default. Staging configurations disable the chart, and other deployment profiles may also disable it, so treat observability as a deployment option that must be enabled and configured for each environment.

### Grafana access on OpenShift [#grafana-access-on-openshift]

On OpenShift clusters, the observability chart can expose Grafana through an OpenShift Route when the Route API is available and the Grafana Route option is enabled. The Route targets the Grafana service, uses the configured host and path, and can include TLS settings such as edge, passthrough, or re-encrypt termination with the deployment's insecure-traffic policy.

This Route is disabled by default. Enable it only when your cluster should expose Grafana through the OpenShift Router instead of another ingress pattern, and keep the hostname, TLS policy, and access controls consistent with your organization's observability access model.

## Alerting [#alerting]

Alert rules can notify operators when metrics exceed thresholds or exhibit anomalous patterns.

| Alert category      | Example condition                                     | Severity |
| ------------------- | ----------------------------------------------------- | -------- |
| Error rate spike    | Error rate above threshold over a sustained window    | Critical |
| Latency degradation | P99 latency materially above baseline                 | Warning  |
| Resource exhaustion | High memory or CPU utilization                        | Warning  |
| Chain connectivity  | Sustained block production or RPC connectivity issues | Critical |
| Transaction failure | Transaction failure rate above threshold              | Warning  |

Alert labels include the originating cluster name from the deployment telemetry configuration. If you run multiple clusters, you can identify the affected environment before inspecting dashboards or logs.

### Severity-based routing [#severity-based-routing]

When Slack notifications are enabled, the stack routes alerts by severity instead of treating every one the same way. The routing tree groups alerts by folder and alert name, then by namespace and cluster, before splitting into two paths with different timing.

| Severity         | First notification | Reminder cadence |
| ---------------- | ------------------ | ---------------- |
| Critical         | 10 seconds         | Hourly           |
| Warning and info | 30 seconds         | Every 4 hours    |

Critical alerts notify faster and repeat more often so a paging-worthy condition is hard to miss. Warning and info alerts use a calmer cadence to keep your Slack channel readable.

How the two paths reach Slack depends on your configured delivery mode:

* **Bot mode** sends each severity to its own destination: a dedicated critical channel and a general operations channel. Each deployment sets the names and Slack credentials in its own configuration.
* **Webhook mode** delivers both severities to the single configured Slack destination. The severity timing split above still applies, but critical and non-critical alerts arrive in the same channel rather than separate ones.

### Slack app setup [#slack-app-setup]

Bot mode needs a Slack app with a bot token. Create the app from a manifest so its scopes and bot user stay reproducible across workspaces. In Slack, open Your Apps, choose Create New App, then From an app manifest, select the target workspace, and paste the following:

```json
{
  "display_information": {
    "name": "DALP Alerts",
    "description": "Routes DALP platform alerts from Grafana to Slack.",
    "background_color": "#346eee"
  },
  "features": {
    "bot_user": {
      "display_name": "DALP Alerts",
      "always_online": true
    }
  },
  "oauth_config": {
    "scopes": {
      "bot": ["chat:write", "chat:write.public", "chat:write.customize"]
    }
  },
  "settings": {
    "org_deploy_enabled": false,
    "socket_mode_enabled": false,
    "token_rotation_enabled": false
  }
}
```

It requests the smallest scope set the integration needs:

| Scope                  | Purpose                                                                          |
| ---------------------- | -------------------------------------------------------------------------------- |
| `chat:write`           | Post alert messages to Slack.                                                    |
| `chat:write.public`    | Post to a public alert channel without inviting the bot first.                   |
| `chat:write.customize` | Set the message name and icon so alerts read as `DALP Alerts`, not the app name. |

Install the app to the workspace and copy the Bot User OAuth Token, which starts with `xoxb-`. Invite the bot to any private alert channel; the `chat:write.public` scope covers public channels without an invite.

Provide the token to the deployment through a Kubernetes Secret, not a value committed to source control. The observability stack reads the token from that Secret into an environment variable and references it from the provisioned contact points, so the token never lands in a ConfigMap. Set the critical and operations channel names in the same deployment configuration that selects bot mode.

### Notification content [#notification-content]

Each Slack notification is structured for immediate operator response rather than raw alert text.

* Color-coded by severity: critical notifications use one color, warning and info each use their own, and resolved notifications use a recovery color so operators can read state at a glance.
* Severity in the title: the notification title states the firing or resolved status, the alert name, and the severity level.
* Context labels: notifications include the infrastructure and chain identifiers that apply to the alert (cluster, namespace, pod, container, chain ID), so operators can scope the investigation immediately.
* Links for follow-up: when the alert rule supplies them, each notification carries a link row for the runbook, the related dashboard or panel, a silence link, and the alert source.

### Grouped notifications [#grouped-notifications]

Related alerts arrive as one Slack notification rather than a separate message per alert. The routing tree groups by folder and alert name, then scopes to namespace and cluster, so a fault that trips the same rule across several pods is delivered as a single notification. The title carries a firing count, such as `[FIRING:3]`, so operators see the spread of a group at a glance.

After the first notification for a group, the stack waits before sending a follow-up. New alerts that join an existing group, and alerts in that group that resolve, are collected and delivered together on the next update roughly five minutes later, instead of one message per change. The first-notification and reminder timing from the severity routing table still applies; the group update interval governs only the follow-ups between them.

When alerts in a group recover, the resolved alerts are summarized inside the same notification under a resolved count and list, so a single message can show what is still firing and what has cleared in one read.

### Maintenance windows [#maintenance-windows]

You can define maintenance windows that mute notifications during planned work. An active maintenance window applies to both the critical and the general routes, so expected disruption during maintenance does not page the on-call operator. Alerts still evaluate and appear in the alerts overview during a maintenance window; only the Slack notifications are suppressed.

### Live indexing alerts [#live-indexing-alerts]

When the observability chart is enabled, DALP can alert on live indexing health. Each alert includes the affected Kubernetes namespace, chain ID, and cluster name. Operators can triage one chain in one cluster without masking it behind healthy chains elsewhere.

| Alert signal              | What it indicates                                                                                                                                               | Triage hint                                                                                    |
| ------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------- |
| Live indexing lag         | The live indexer is more than 1,000 blocks behind chain head, has not reduced that lag over a 30-minute lookback, and remains in that condition for 15 minutes. | Check whether indexer lag is rising, flat, or recovering, then compare indexer and RPC health. |
| Sync or handler errors    | The live indexer recorded sync failures or event-handler failures in the recent alert window.                                                                   | Query the handler-error metric by event and contract type to find the failing handler.         |
| Backfill not progressing  | The live indexer has pending backfill work and the pending count has not decreased over the monitored window.                                                   | Check indexer pod health, RPC latency, and whether the pending queue is growing or flat.       |
| Native-balance collection | Operator-wallet balances, balance fetches, or refresh-queue depth need attention.                                                                               | Use the affected chain ID and wallet address to confirm the balance or collector backlog.      |

For live-indexing alerts, start from the alert labels, then inspect the indexer dashboard and logs for the matching cluster and namespace, filtered by chain ID. Handler-error alerts fire at the chain level. Query the underlying handler-error metric by event and contract type to distinguish one failing handler from a chain-wide outage.

### Meta-transaction attribution metrics [#meta-transaction-attribution-metrics]

DALP records signer attribution for indexed domain events. For each event, the resolver checks the same transaction for the next `ExecutedForwardRequest` marker from a registered forwarder. Forwarder-marked events use the marker's signer. Direct and unmarked events use the transaction sender.

Forwarder attribution follows the forwarder's active window. When a deployment rotates to a new forwarder, DALP keeps the previous forwarder available for historical event blocks. DALP does not trust the previous forwarder for later events. Historical backfills use the markers that were active for the backfilled block range, so reindexing can rebuild signer attribution for older forwarded transactions.

Use these counters to inspect ERC-2771 signer attribution:

* `dalp.didx.meta_tx.signer_resolved`: Counts each event with resolved signer attribution. Use the `resolver_source`, `event`, and `contract_type` labels to separate forwarder-attributed events from transaction-sender attribution.
* `dalp.didx.meta_tx.signer_caller_divergence`: Counts events where the resolved signer differs from an on-chain `caller` field. DALP emits this counter for event families that still carry that field, including `TokenBound` and `TokenUnbound`.

Use `signer_resolved` as the baseline for attribution volume. Use `signer_caller_divergence` to investigate signer and caller mismatches on the event families that emit it.

## Application logging configuration [#application-logging-configuration]

Configure application logging through the `config.yml` file. The two settings below control log verbosity and ORPC request tracing. `LOG_LEVEL` takes precedence during auto-configuration. Invalid values are silently ignored; the platform falls back to debug in development, info in production, and warning in test.

| Setting               | Environment variable                  | Default | Description                                                             |
| --------------------- | ------------------------------------- | ------- | ----------------------------------------------------------------------- |
| `app.logLevel`        | `LOG_LEVEL` or `SETTLEMINT_LOG_LEVEL` | `info`  | Minimum log level: `debug`, `info`, `warn`, `warning`, `error`, `fatal` |
| `app.logOrpcRequests` | `LOG_ORPC_REQUESTS`                   | `false` | Enable verbose ORPC request/response logging                            |

### ORPC request logging [#orpc-request-logging]

When `app.logOrpcRequests` is enabled, the platform logs the request ID and URL for each API call. It also records the HTTP method, elapsed time, response status codes, and procedure execution paths.

This setting is disabled by default to keep logs clean in development and production. Enable it for debugging API issues via `config.yml` or by setting the environment variable directly.

```yaml
# config.yml
app:
  logOrpcRequests: true
```

```bash
LOG_ORPC_REQUESTS=true
```

![On-chain transaction monitoring](/docs/screenshots/monitoring/blockchain-monitoring.webp)

## Audit logging [#audit-logging]

Observability data supports audit investigations by preserving operational events and correlation context. The following event types are typically captured: authentication events with outcome and context, authorization decisions with resource and result, data access with query details, configuration changes with before and after state, and administrative operations with operator identity.

Retention duration, export configuration, and tamper-evidence requirements depend on the deployment's logging storage and compliance policy.

## Incident response [#incident-response]

During an incident, start with the signal that paged you, then keep the investigation anchored to the affected environment, namespace, chain ID, request ID, or trace ID.

| Investigation step       | Use this signal                                            | Outcome                                                |
| ------------------------ | ---------------------------------------------------------- | ------------------------------------------------------ |
| Correlate one operation  | Request ID or trace ID                                     | Link logs, metrics, and traces for the same operation. |
| Rebuild the timeline     | Log search with time filters                               | Identify the event sequence before and after impact.   |
| Estimate customer impact | Request volume, error rates, affected services             | Separate isolated failures from platform-wide impact.  |
| Locate failing component | Trace spans, service health, and indexer and RPC snapshots | Focus remediation on the failing component boundary.   |
| Confirm recovery         | Error rate, latency, lag, and snapshot trend changes       | Verify that the same signal returned to baseline.      |

## SIEM and operations handoff [#siem-and-operations-handoff]

DALP observability helps operations teams find the right signal. The deployment's SIEM and incident process remain the authoritative system for escalation decisions, case retention, and incident management. Route selected logs, traces, metrics, or alerts into the organization's monitoring environment when the deployment design requires it.

| Handoff question             | Start in DALP                                          | Continue in the operator environment                                        |
| ---------------------------- | ------------------------------------------------------ | --------------------------------------------------------------------------- |
| Which environment paged us?  | Alert labels, cluster name, namespace, chain ID        | On-call routing, incident ticket, escalation policy                         |
| What request or job failed?  | Request ID, trace ID, log search, failed service       | SIEM correlation, case notes, and identity-provider and network logs        |
| Is the chain path unhealthy? | RPC status, indexer lag, block age, finality lag       | Node-provider status, network monitoring, provider support ticket           |
| Is there regulated impact?   | Audit logs, compliance records, custody-related events | Formal incident record, regulatory evidence pack, retention policy          |
| Has service recovered?       | Error rate, latency, lag, dashboard trend              | Post-incident review, SLA reporting, and restoration and notification proof |

Keep the split explicit during handoff: DALP shows platform telemetry and product evidence. The operator's SIEM, identity provider, custody provider, network provider, and incident system complete the security and regulatory timeline.

## Deployment integration [#deployment-integration]

Self-hosted deployments can use the DALP observability chart for in-cluster telemetry. The chart handles collection and storage in the same deployment, with Grafana for dashboards. If your organization already operates a monitoring platform, use the deployment configuration to decide which telemetry components to enable and where to route their output.

When enabled, the observability chart includes Grafana dashboard configuration for common self-hosted deployments.

## See also [#see-also]

* [Operability](/docs/architects/operability) for the wider production operations model
* [Database](/docs/architects/operability/database) for database monitoring
* [Failure modes](/docs/architects/operability/failure-modes) for recovery behaviour during outages
* [Blockchain monitoring](/docs/developers/operations/blockchain-monitoring) for chain RPC and indexer diagnostics
* [Broadcast](/docs/architects/components/infrastructure/broadcast) for network metrics