Docker Daemon Health Checks

Dockward continuously monitors Docker daemon connectivity to ensure reliable operations and provide visibility when Docker becomes unavailable.

Overview

The Docker health checker:

Pings the Docker daemon - Uses the /_ping endpoint for lightweight connectivity checks
Tracks health status - Maintains connection state and failure counts
Exposes metrics - Reports health status via Prometheus metrics
Updates health endpoint - Reflects Docker status in /health API response

Configuration

Configure health check intervals in your config.json:

{
  "docker_health": {
    "check_interval": 30,
    "timeout": 5
  }
}

Parameters

Parameter	Type	Default	Description
`check_interval`	int	30	Seconds between health checks (min: 5, max: 3600)
`timeout`	int	5	Timeout for each ping request (min: 1, max: 30)

Validation: timeout must be less than check_interval

Health Check Behavior

Startup

First check runs immediately on startup
Subsequent checks run at configured intervals
Initial state is unhealthy until first successful check

Success

When a ping succeeds:

Health status set to healthy
Consecutive failure count reset to 0
Docker version and API version recorded
Metrics updated

Failure

When a ping fails:

Health status set to unhealthy
Consecutive failure count incremented
Error message recorded
Metrics updated

Common Failure Scenarios

Docker daemon not running

Error: dial unix /var/run/docker.sock: connect: no such file or directory

Permission denied

Error: dial unix /var/run/docker.sock: connect: permission denied

Timeout
```
Error: context deadline exceeded
```

Health Endpoint Integration

The /health endpoint returns Docker daemon status:

Healthy Response (200 OK)

{
  "status": "healthy",
  "components": {
    "http": {
      "healthy": true
    },
    "docker": {
      "healthy": true,
      "last_check": "2026-03-03T10:30:00Z",
      "last_healthy_check": "2026-03-03T10:30:00Z",
      "consecutive_fails": 0,
      "docker_version": "Docker/24.0.7 (linux)",
      "api_version": "1.45"
    }
  },
  "config_warnings": [
    "service[18] \"otel-collector\": compose_file[0] not found: \"/srv/observability/docker-compose.yml\""
  ]
}

Note: The config_warnings field appears only when services failed validation during config load. These services are skipped and not monitored.

Unhealthy Response (503 Service Unavailable)

{
  "status": "unhealthy",
  "components": {
    "http": {
      "healthy": true
    },
    "docker": {
      "healthy": false,
      "last_check": "2026-03-03T10:30:15Z",
      "last_healthy_check": "2026-03-03T10:29:45Z",
      "consecutive_fails": 3,
      "last_error": "dial unix /var/run/docker.sock: connect: no such file or directory"
    }
  }
}

Prometheus Metrics

The health checker exposes three metrics via /metrics:

`docker_daemon_healthy`

Type: Gauge
Values: 1 (healthy) or 0 (unhealthy)

Indicates current Docker daemon connectivity status.

docker_daemon_healthy 1

`docker_daemon_consecutive_failures`

Type: Gauge
Values: 0-N

Number of consecutive failed health checks. Resets to 0 on successful check.

docker_daemon_consecutive_failures 0

`docker_daemon_checks_total`

Type: Counter
Values: Cumulative count

Total number of health checks performed since startup.

docker_daemon_checks_total 120

Monitoring and Alerting

Prometheus Alert Rules

Example alert when Docker becomes unavailable:

groups:
  - name: dockward
    rules:
      - alert: DockerDaemonDown
        expr: docker_daemon_healthy == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Docker daemon is unavailable"
          description: "Docker daemon health check has failed for 1 minute ({{ $value }} consecutive failures)"

      - alert: DockerDaemonFlapping
        expr: rate(docker_daemon_checks_total[5m]) - rate(docker_daemon_checks_total[5m] offset 5m) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Docker daemon is flapping"
          description: "Docker daemon health is unstable"

Monitoring Health Endpoint

Use /health for external monitoring (Kubernetes, load balancers):

# Simple check
curl http://localhost:9090/health

# Check exit code
curl -f http://localhost:9090/health || echo "Unhealthy"

# Parse JSON status
curl -s http://localhost:9090/health | jq -r '.status'

Operational Notes

When Docker is Unavailable

While Docker is unhealthy:

Updater: Cannot list containers, inspect images, or check for updates
Healer: Cannot restart containers or receive health events
Monitor: Cannot collect container resource stats
API: Continues serving HTTP (health endpoint returns 503)

Recommendation: Monitor docker_daemon_healthy and alert operators when Docker is down.

Recovery

When Docker recovers:

Health status automatically updates on next successful check
All components resume normal operation
No manual intervention required

Systemd Health Check

Integrate with systemd watchdog:

[Service]
WatchdogSec=60
ExecStartPost=/bin/sh -c 'while ! curl -sf http://localhost:9090/health > /dev/null; do sleep 1; done'

Docker Socket Permissions

Ensure dockward has access to Docker socket:

# Add user to docker group
sudo usermod -aG docker dockward

# Or adjust socket permissions (not recommended)
sudo chmod 666 /var/run/docker.sock

Troubleshooting

Health Checks Always Fail

Symptoms: docker_daemon_healthy always 0, consecutive_fails increasing

Causes:

Docker daemon not running

sudo systemctl status docker
sudo systemctl start docker

Socket path mismatch (default: /var/run/docker.sock)
```
ls -l /var/run/docker.sock
```

Permission denied

groups $(whoami)  # Should include 'docker'

High Check Interval

Symptoms: Long delay before detecting Docker failure

Solution: Reduce check_interval in config:

{
  "docker_health": {
    "check_interval": 10,
    "timeout": 3
  }
}

Trade-off: More frequent checks increase CPU overhead (minimal impact)

Timeouts

Symptoms: Health checks timeout even though Docker is running

Causes:

Docker daemon overloaded
Slow disk I/O
System resource exhaustion

Solution: Increase timeout or investigate Docker performance:

# Check Docker daemon logs
journalctl -u docker -n 100

# Monitor system resources
top
iostat -x 1

Best Practices

Set appropriate intervals
- Production: 30s check interval, 5s timeout
- Development: 10s check interval, 3s timeout
Monitor the metrics
- Alert on docker_daemon_healthy == 0 for >1 minute
- Track consecutive_failures for flapping detection
Use /health for readiness
- Return 503 when Docker is unavailable
- Prevents routing traffic to unhealthy instances
Log review
- Check logs when health degrades
- Look for permission or connectivity issues
Test recovery
- Stop Docker daemon: sudo systemctl stop docker
- Verify health status changes to unhealthy
- Start Docker: sudo systemctl start docker
- Verify automatic recovery