Untitled
Last updated: March 4, 2026
Docker Daemon Health Checks
Dockward continuously monitors Docker daemon connectivity to ensure reliable operations and provide visibility when Docker becomes unavailable.
Overview
The Docker health checker:
- Pings the Docker daemon - Uses the
/_pingendpoint for lightweight connectivity checks - Tracks health status - Maintains connection state and failure counts
- Exposes metrics - Reports health status via Prometheus metrics
- Updates health endpoint - Reflects Docker status in
/healthAPI response
Configuration
Configure health check intervals in your config.json:
{
"docker_health": {
"check_interval": 30,
"timeout": 5
}
}
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
check_interval |
int | 30 | Seconds between health checks (min: 5, max: 3600) |
timeout |
int | 5 | Timeout for each ping request (min: 1, max: 30) |
Validation: timeout must be less than check_interval
Health Check Behavior
Startup
- First check runs immediately on startup
- Subsequent checks run at configured intervals
- Initial state is unhealthy until first successful check
Success
When a ping succeeds:
- Health status set to
healthy - Consecutive failure count reset to 0
- Docker version and API version recorded
- Metrics updated
Failure
When a ping fails:
- Health status set to
unhealthy - Consecutive failure count incremented
- Error message recorded
- Metrics updated
Common Failure Scenarios
-
Docker daemon not running
Error: dial unix /var/run/docker.sock: connect: no such file or directory -
Permission denied
Error: dial unix /var/run/docker.sock: connect: permission denied -
Timeout
Error: context deadline exceeded
Health Endpoint Integration
The /health endpoint returns Docker daemon status:
Healthy Response (200 OK)
{
"status": "healthy",
"components": {
"http": {
"healthy": true
},
"docker": {
"healthy": true,
"last_check": "2026-03-03T10:30:00Z",
"last_healthy_check": "2026-03-03T10:30:00Z",
"consecutive_fails": 0,
"docker_version": "Docker/24.0.7 (linux)",
"api_version": "1.45"
}
},
"config_warnings": [
"service[18] \"otel-collector\": compose_file[0] not found: \"/srv/observability/docker-compose.yml\""
]
}
Note: The config_warnings field appears only when services failed validation during config load. These services are skipped and not monitored.
Unhealthy Response (503 Service Unavailable)
{
"status": "unhealthy",
"components": {
"http": {
"healthy": true
},
"docker": {
"healthy": false,
"last_check": "2026-03-03T10:30:15Z",
"last_healthy_check": "2026-03-03T10:29:45Z",
"consecutive_fails": 3,
"last_error": "dial unix /var/run/docker.sock: connect: no such file or directory"
}
}
}
Prometheus Metrics
The health checker exposes three metrics via /metrics:
docker_daemon_healthy
Type: Gauge
Values: 1 (healthy) or 0 (unhealthy)
Indicates current Docker daemon connectivity status.
docker_daemon_healthy 1
docker_daemon_consecutive_failures
Type: Gauge
Values: 0-N
Number of consecutive failed health checks. Resets to 0 on successful check.
docker_daemon_consecutive_failures 0
docker_daemon_checks_total
Type: Counter
Values: Cumulative count
Total number of health checks performed since startup.
docker_daemon_checks_total 120
Monitoring and Alerting
Prometheus Alert Rules
Example alert when Docker becomes unavailable:
groups:
- name: dockward
rules:
- alert: DockerDaemonDown
expr: docker_daemon_healthy == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Docker daemon is unavailable"
description: "Docker daemon health check has failed for 1 minute ({{ $value }} consecutive failures)"
- alert: DockerDaemonFlapping
expr: rate(docker_daemon_checks_total[5m]) - rate(docker_daemon_checks_total[5m] offset 5m) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "Docker daemon is flapping"
description: "Docker daemon health is unstable"
Monitoring Health Endpoint
Use /health for external monitoring (Kubernetes, load balancers):
# Simple check
curl http://localhost:9090/health
# Check exit code
curl -f http://localhost:9090/health || echo "Unhealthy"
# Parse JSON status
curl -s http://localhost:9090/health | jq -r '.status'
Operational Notes
When Docker is Unavailable
While Docker is unhealthy:
- Updater: Cannot list containers, inspect images, or check for updates
- Healer: Cannot restart containers or receive health events
- Monitor: Cannot collect container resource stats
- API: Continues serving HTTP (health endpoint returns 503)
Recommendation: Monitor docker_daemon_healthy and alert operators when Docker is down.
Recovery
When Docker recovers:
- Health status automatically updates on next successful check
- All components resume normal operation
- No manual intervention required
Systemd Health Check
Integrate with systemd watchdog:
[Service]
WatchdogSec=60
ExecStartPost=/bin/sh -c 'while ! curl -sf http://localhost:9090/health > /dev/null; do sleep 1; done'
Docker Socket Permissions
Ensure dockward has access to Docker socket:
# Add user to docker group
sudo usermod -aG docker dockward
# Or adjust socket permissions (not recommended)
sudo chmod 666 /var/run/docker.sock
Troubleshooting
Health Checks Always Fail
Symptoms: docker_daemon_healthy always 0, consecutive_fails increasing
Causes:
-
Docker daemon not running
sudo systemctl status docker sudo systemctl start docker -
Socket path mismatch (default:
/var/run/docker.sock)ls -l /var/run/docker.sock -
Permission denied
groups $(whoami) # Should include 'docker'
High Check Interval
Symptoms: Long delay before detecting Docker failure
Solution: Reduce check_interval in config:
{
"docker_health": {
"check_interval": 10,
"timeout": 3
}
}
Trade-off: More frequent checks increase CPU overhead (minimal impact)
Timeouts
Symptoms: Health checks timeout even though Docker is running
Causes:
- Docker daemon overloaded
- Slow disk I/O
- System resource exhaustion
Solution: Increase timeout or investigate Docker performance:
# Check Docker daemon logs
journalctl -u docker -n 100
# Monitor system resources
top
iostat -x 1
Best Practices
-
Set appropriate intervals
- Production: 30s check interval, 5s timeout
- Development: 10s check interval, 3s timeout
-
Monitor the metrics
- Alert on
docker_daemon_healthy == 0for >1 minute - Track
consecutive_failuresfor flapping detection
- Alert on
-
Use /health for readiness
- Return 503 when Docker is unavailable
- Prevents routing traffic to unhealthy instances
-
Log review
- Check logs when health degrades
- Look for permission or connectivity issues
-
Test recovery
- Stop Docker daemon:
sudo systemctl stop docker - Verify health status changes to unhealthy
- Start Docker:
sudo systemctl start docker - Verify automatic recovery
- Stop Docker daemon: