Docker container monitoring: complete guide 2025
The definitive guide to Docker container monitoring in 2025. What metrics matter, how to set up alerts, how to debug crashes, and which tools to use.
Docker container monitoring is the practice of continuously observing the health, performance, and availability of your containerized workloads so you know when something breaks before your users do. This guide covers everything — what metrics actually matter, how push vs pull monitoring works, how to set up alerts that don't wake you up for nothing, and how to debug a container that's misbehaving in production.
If you're new to Docker monitoring, start at the top. If you're specifically looking for debugging techniques or tool comparisons, jump to the relevant section.
What metrics actually matter
Not all metrics are equal. Here's what to focus on for Docker containers:
CPU usage
Measured as a percentage of the allocated CPU. A container running at 95% CPU for hours is either:
- Underpowered — needs more CPU allocated
- Leaking — a runaway goroutine, infinite loop, or bad query
- Under attack — cryptomining malware or a parsing exploit
Track it. Alert when sustained CPU exceeds ~85% for more than 5 minutes (not on brief spikes — those are normal during startup or heavy requests).
Memory usage
Track both memory used and memory limit. The dangerous pattern isn't high absolute memory usage — it's memory usage that grows steadily over time with no ceiling. That's a memory leak.
The critical event to track is an OOM kill — when a container exceeds its memory limit and the Linux kernel forcibly kills it (exit code 137). We'll cover this in depth in the OOM kills section.
Restart count
Docker tracks how many times a container has restarted. A restart count that's ticking up is your first signal that something is wrong. By the time you notice it in docker ps, the container may have restarted 20 times. Alert at 3-5 restarts.
Container status
The binary "up/down" isn't enough. A container can be:
running— normalexited— stopped, could be intentional or crashrestarting— Docker's restart policy is looping trying to bring it backpaused— suspended (rare in production)dead— Docker can't manage it, requires manual intervention
The exit code tells you why it stopped. More on this below.
Network I/O
Track bytes received and bytes transmitted per container. Spikes in rx bytes can indicate a DDoS or scraper. Spikes in tx bytes can indicate data exfiltration or a runaway notification system. Flat-zero rx/tx when you expect activity means your container is isolated (networking issue, crashed service upstream).
Disk I/O (optional)
Unless your container does significant disk work (databases, log-heavy services), disk I/O is less critical than CPU and memory. When a container is thrashing disk, it usually shows up as high CPU wait and slow response times before you see disk metrics.
OOM kills: the most misunderstood Docker metric
An OOM kill (Out Of Memory kill) happens when:
- A container's process tries to allocate more memory than its configured limit
- The Linux kernel's OOM killer steps in and kills the process
- Docker records exit code 137 (128 + 9, where 9 is SIGKILL)
This is distinct from a normal exit (code 0) or a crash (code 1 or higher non-137). OOM kills are silent by default — Docker doesn't alert you, it just restarts the container (if your restart policy says to). If your container is OOM killing multiple times per hour, you'll see the restart count climbing but the logs might not tell you why.
How to detect an OOM kill manually
# Check exit code of a stopped container
docker inspect my-container | grep -E '"ExitCode"|"OOMKilled"'
# Output:
# "ExitCode": 137,
# "OOMKilled": true,
# See recent system-level OOM messages
dmesg | grep -i "oom"
# Output:
# [1234567.123456] Out of memory: Kill process 12345 (node)
# score 987 or sacrifice child
Common OOM kill causes
- Memory limit set too low — the container needs more than you gave it. Solution: increase the memory limit in your
docker-compose.yml:
services:
api:
image: myapp:latest
deploy:
resources:
limits:
memory: 512m # Increase this if OOM killing
reservations:
memory: 256m
-
Memory leak — the application isn't releasing memory it should. The memory limit is actually protecting you here by capping the blast radius. But you need to fix the leak.
-
Temporary spike — some workloads need more memory during cold start or while processing large payloads. Consider if a higher limit or horizontal scaling is the answer.
Push-based vs pull-based monitoring
Understanding this architecture distinction helps you choose the right tool.
Pull-based (Prometheus model)
The monitoring system periodically sends HTTP requests to your services asking "what are your current metrics?" Your services must expose an HTTP endpoint (a "metrics endpoint") that responds with metrics in a specific format.
Pros:
- The monitoring system controls the scrape interval
- Easy to see which targets are up/down (if scraping fails, the target is probably down)
- Well-suited for service discovery in dynamic environments (k8s)
Cons:
- Requires network access from monitoring system to every monitored host
- Every container must expose a metrics endpoint (usually requires adding a library or exporter)
- If the network between monitoring and targets is down, you have a blind spot
Push-based (Kernus model)
Your services proactively send metrics to a central collection point. The agent running on your host pushes container metrics to the cloud.
Pros:
- Works behind firewalls and NAT — the agent only needs outbound internet access
- No changes to your containers — the agent reads Docker's API directly
- Monitoring continues even if the central server has a brief outage (agent buffers)
- Much simpler firewall rules: outbound 443 only
Cons:
- You don't know a host is down until you stop receiving its metrics (requires "heartbeat" logic)
- The monitoring service must handle pushes at scale
Most SaaS monitoring tools (including Kernus) use push-based collection because it works in more network environments without firewall changes. Prometheus uses pull-based collection because it was designed for internal Kubernetes environments where network access is controlled.
Setting up alerts that don't wake you up for nothing
Alert fatigue is real. If your monitoring sends an alert every time CPU spikes for 2 seconds, you'll start ignoring alerts — which defeats the purpose entirely. Good alerting has two properties: high recall (catches real problems) and high precision (doesn't fire on false positives).
Duration-based alerting
Don't alert on instantaneous values. Alert on sustained values. A container at 98% CPU for 3 minutes is a problem. A container at 98% CPU for 3 seconds is a normal startup spike.
Alert: CPU above 85% for 5 consecutive minutes
Alert: Memory above 80% for 10 consecutive minutes
Alert: Restart count increased by 3 in the last 30 minutes
Separate alerts by severity
Not everything needs to wake someone up at 3 AM. Structure your alerts:
| Severity | Trigger | Channel |
|---|---|---|
| P0 (Page) | Container down, OOM kill loop | Phone call / SMS |
| P1 (Urgent) | High restart count, sustained 90%+ CPU | Slack + SMS |
| P2 (Warning) | Memory leak pattern, approaching limits | Slack |
| P3 (Info) | Unusual network activity, first restart | Email digest |
Container-specific thresholds
A database container at 60% memory is normal. A stateless API at 60% memory might indicate a leak. Don't use the same thresholds for every container — tune them based on what you know about each service's behavior.
Debugging a container in production
When something is wrong, here's a systematic approach:
Step 1: Check current status
# See all containers and their state
docker ps -a
# Key columns: STATUS, CREATED, PORTS
# Look for: "Restarting", "Exited (137)", high restart count
Step 2: Read the exit code
docker inspect <container_name> --format='{{.State.ExitCode}} {{.State.OOMKilled}}'
# "0 false" → clean stop (intentional)
# "1 false" → application crash (read logs)
# "137 true" → OOM kill (increase memory limit)
# "137 false" → SIGKILL from outside (deployment? manual kill?)
# "143 false" → SIGTERM (graceful shutdown, likely intentional)
Step 3: Read the logs
# Last 100 lines of logs
docker logs --tail 100 <container_name>
# Follow logs in real time
docker logs -f <container_name>
# Logs from a specific time window
docker logs --since 2025-01-15T14:30:00 <container_name>
Look for:
- Stack traces (the line before the panic/exception is usually the cause)
- "connection refused" (dependency not ready)
- "out of memory" or "cannot allocate" (memory issue)
- "address already in use" (port conflict)
- "permission denied" (volume mount issue or file permissions)
Step 4: Check resource usage
# Real-time stats for all containers
docker stats
# One-shot snapshot
docker stats --no-stream
If a container is at 100% CPU but producing no output, it's likely stuck in an infinite loop or waiting on something.
Step 5: Check dependencies
Many container crashes aren't about the container itself — they're about a dependency that isn't ready:
# Check if your database is accessible from inside the container
docker exec <container_name> nc -zv database-host 5432
# Check if an external URL is reachable
docker exec <container_name> curl -sf https://api.example.com/health
Tool options for Docker monitoring
cAdvisor (free, self-hosted)
Google's cAdvisor is a standalone container that exposes Docker metrics. Run it, hit its web UI on port 8080, see basic container CPU/memory/network.
- Pros: Free, simple, no external dependencies
- Cons: No alerting, no persistent history, no multi-host, UI is basic
Best for: dev environments, quick inspection.
Prometheus + Grafana (free, complex)
As covered in our Prometheus guide, this is the industrial-grade self-hosted option. Powerful but requires significant setup and maintenance.
- Pros: Extremely flexible, huge ecosystem, handles k8s well
- Cons: 5+ containers to maintain, 3-5 hours setup, ongoing maintenance
Best for: large teams with a dedicated platform engineer.
Netdata (free tier available, self-hosted)
Netdata is a high-resolution (per-second) monitoring tool that runs on each host. Easy to install, built-in dashboards, some alerting built in.
- Pros: Easy install, beautiful dashboards, very granular metrics
- Cons: Primarily per-host (multi-host requires their cloud), free tier limits apply
Best for: teams that want detailed per-host visibility and are comfortable with self-hosted.
Datadog (paid, hosted)
Industry-standard enterprise monitoring. APM, log management, 800+ integrations, ML anomaly detection.
- Pros: Most comprehensive platform, excellent k8s support
- Cons: Expensive ($200-500+/month for small teams), complex pricing
Best for: large teams with budget, need for APM or extensive integrations.
Kernus (paid, hosted, Docker-focused)
Purpose-built for Docker monitoring. Two commands to set up, automatic container discovery, alerts on all channels.
- Pros: Fastest setup (2 minutes), includes status page + uptime badges + digest emails, flat pricing
- Cons: Docker-only (no k8s), no APM, no custom metrics from your application
Best for: small-to-mid teams with Docker-first infrastructure who want monitoring without the operational overhead.
The minimum viable monitoring setup
If you do nothing else, do these three things:
- Alert when a container stops unexpectedly — set up at least an email alert when any production container enters an exited state
- Alert on high restart counts — a container restarting 5 times is a problem, 20 times is a crisis
- Track OOM kills — even if you don't alert on them immediately, log them so you know when to increase memory limits
Everything beyond this is additive. Start here.
For a deep dive on one specific problem: OOM kills in Docker — how to detect and prevent them. For setting up alert channels specifically: How to set up Docker container alerts for Slack, Discord, and Telegram.
Try Kernus free
Set up Docker monitoring in 2 minutes. Free for 1 host — no credit card required.
Start monitoring