Troubleshooting

Service stuck in `not_found`

Symptom: GET /status shows "status":"not_found" and no deploys are attempted.

Cause: Dockward compared the remote digest against the local image and could not resolve the local image. This happens when:

The image has never been pulled on this host
The images entry in config does not match the image name in the compose file
The local registry has the image but Docker’s image store does not

Fix: Verify the image name in images[] matches exactly what docker images shows. If the image is absent locally, push a new digest to the registry — dockward will detect the change and deploy.

Service stuck in `blocked`

Symptom: GET /status shows "status":"blocked". No deploys occur even after pushing a new image.

Cause: A rollback was triggered and dockward blocked the digest that caused it. It will unblock automatically when the remote digest changes (i.e. a new image is pushed).

Fix: Either push a fixed image to the registry, or manually unblock:

curl -sf -X POST localhost:9090/unblock/myapp

Only unblock manually if you have confirmed the current image in the registry is healthy.

Rollback immediately re-blocks

Symptom: Push a new image, deploy starts, rolls back, blocked again.

Cause: The container fails the health check within health_grace seconds.

Fix:

Check container logs: docker logs <container>
Increase health_grace to give the container more startup time
Verify the Docker HEALTHCHECK in the image is correct

Updater redeploys on every poll

Symptom: updated events fire every poll cycle even though the image has not changed.

Cause: The image reference in images[] does not match the locally stored image name, causing the digest lookup to always return a mismatch or fail.

Fix: Run docker images and confirm the name and tag exactly match the value in images[].

Healer restarts a container in a loop

Symptom: Repeated restarted entries in the audit log; service never stabilises.

Cause: The container crashes or stays unhealthy faster than heal_cooldown allows it to be declared exhausted, or heal_max_restarts is set too high.

Fix:

Lower heal_max_restarts (default is 3)
Investigate the container crash: docker logs <container>
Fix the underlying application issue

CPU / memory always shows `--` in the web UI

Symptom: The CPU / Mem column shows -- for all services.

Cause 1: monitor.stats_interval has not elapsed since startup — stats are collected on a separate interval.

Cause 2: The service has no running containers matching the compose_project label.

Cause 3: The service has silent: true.

Fix: Wait one stats_interval cycle, then check GET /status for has_stats: true. If it stays false, verify the compose project name matches docker ps --format '{{.Labels}}' for com.docker.compose.project.

Config validation warnings on startup

Symptom: Dockward starts successfully but logs warnings like:

[config] WARNING: 1 service(s) failed validation and will be skipped:
  - service[18] "otel-collector": compose_file[0] not found: "/srv/observability/docker-compose.yml"

Cause: One or more services have invalid configuration (missing files, invalid paths, etc.).

Impact: The invalid service(s) are skipped and not monitored. Valid services continue operating normally.

Fix:

Check the warning message for the specific reason
Fix the configuration issue:
- Ensure compose files exist at the specified paths
- Verify paths are absolute (not relative)
- Check file permissions
Restart dockward: sudo systemctl restart dockward
Verify no warnings: journalctl -u dockward -n 50

Monitoring: Check GET /health endpoint - it includes a config_warnings array listing all skipped services:

curl -s http://localhost:9090/health | jq .config_warnings

Note: As of v1.0.0-alpha.9, service-level validation errors are non-fatal. Only global errors (invalid runtime) cause startup failure.

Systemd service will not start

Symptom: systemctl status dockward shows failed immediately.

Cause: Config file parse error or global validation failure.

Fix: Run dockward directly to see the error:

/usr/local/bin/dockward -config /etc/dockward/config.json

Common causes:

JSON syntax error (trailing comma, missing brace)
Invalid runtime (must be "docker" or "podman")
Invalid api.port (must be 1-65535)
Invalid global settings (poll_interval, docker_health timing)

Note: Service-level errors (missing compose files, invalid service config) no longer cause startup failure - they are logged as warnings instead.

Audit log not written

Symptom: GET /audit returns []. No file appears at the configured path.

Cause: audit.path is empty or the directory does not exist.

Fix:

sudo mkdir -p /var/log/dockward

Verify audit.path in the config file points to a writable location.

Warden: agent entries not appearing

Symptom: Events from one or more agents are absent from the warden dashboard.

Cause: Push is misconfigured or the warden is unreachable from the agent.

Fix:

Check agent logs for push to warden failed. Then verify:

push.warden_url is reachable from the agent host
push.token matches the corresponding agents[].token in warden config
The warden is running: curl -s https://warden.example.com/health

Warden: agent shown as offline

Symptom: The warden dashboard shows an agent as offline.

Cause: The warden heartbeat polls GET <agents[].url>/health every 30 seconds. The agent API binds to 127.0.0.1 — it is not reachable from another host on its default address.

Fix: Set agents[].url to an internal network address that the warden host can reach, not 127.0.0.1.

SSE stream disconnects (warden or agent UI)

Symptom: The event feed in the web UI stops updating. Browser console shows connection errors.

Cause: A reverse proxy (nginx) is buffering the SSE response.

Fix: Add proxy_buffering off to the nginx location block, or ensure the X-Accel-Buffering: no header passes through from dockward to the browser.

Troubleshooting

Service stuck in not_found

Service stuck in blocked

Rollback immediately re-blocks

Updater redeploys on every poll

Healer restarts a container in a loop

CPU / memory always shows -- in the web UI

Config validation warnings on startup

Systemd service will not start

Audit log not written

Warden: agent entries not appearing

Warden: agent shown as offline

SSE stream disconnects (warden or agent UI)

Service stuck in `not_found`

Service stuck in `blocked`

CPU / memory always shows `--` in the web UI