February 5, 202510 min read

Docker container monitoring: complete guide 2025

The definitive guide to Docker container monitoring in 2025. What metrics matter, how to set up alerts, how to debug crashes, and which tools to use.

dockermonitoringcontainersdevopsguide

Docker container monitoring is the practice of continuously observing the health, performance, and availability of your containerized workloads so you know when something breaks before your users do. This guide covers everything — what metrics actually matter, how push vs pull monitoring works, how to set up alerts that don't wake you up for nothing, and how to debug a container that's misbehaving in production.

If you're new to Docker monitoring, start at the top. If you're specifically looking for debugging techniques or tool comparisons, jump to the relevant section.

What metrics actually matter

Not all metrics are equal. Here's what to focus on for Docker containers:

CPU usage

Measured as a percentage of the allocated CPU. A container running at 95% CPU for hours is either:

Underpowered — needs more CPU allocated
Leaking — a runaway goroutine, infinite loop, or bad query
Under attack — cryptomining malware or a parsing exploit

Track it. Alert when sustained CPU exceeds ~85% for more than 5 minutes (not on brief spikes — those are normal during startup or heavy requests).

Memory usage

Track both memory used and memory limit. The dangerous pattern isn't high absolute memory usage — it's memory usage that grows steadily over time with no ceiling. That's a memory leak.

The critical event to track is an OOM kill — when a container exceeds its memory limit and the Linux kernel forcibly kills it (exit code 137). We'll cover this in depth in the OOM kills section.

Restart count

Docker tracks how many times a container has restarted. A restart count that's ticking up is your first signal that something is wrong. By the time you notice it in docker ps, the container may have restarted 20 times. Alert at 3-5 restarts.

Container status

The binary "up/down" isn't enough. A container can be:

running — normal
exited — stopped, could be intentional or crash
restarting — Docker's restart policy is looping trying to bring it back
paused — suspended (rare in production)
dead — Docker can't manage it, requires manual intervention

The exit code tells you why it stopped. More on this below.

Network I/O

Track bytes received and bytes transmitted per container. Spikes in rx bytes can indicate a DDoS or scraper. Spikes in tx bytes can indicate data exfiltration or a runaway notification system. Flat-zero rx/tx when you expect activity means your container is isolated (networking issue, crashed service upstream).

Disk I/O (optional)

Unless your container does significant disk work (databases, log-heavy services), disk I/O is less critical than CPU and memory. When a container is thrashing disk, it usually shows up as high CPU wait and slow response times before you see disk metrics.

OOM kills: the most misunderstood Docker metric

An OOM kill (Out Of Memory kill) happens when:

A container's process tries to allocate more memory than its configured limit
The Linux kernel's OOM killer steps in and kills the process
Docker records exit code 137 (128 + 9, where 9 is SIGKILL)

This is distinct from a normal exit (code 0) or a crash (code 1 or higher non-137). OOM kills are silent by default — Docker doesn't alert you, it just restarts the container (if your restart policy says to). If your container is OOM killing multiple times per hour, you'll see the restart count climbing but the logs might not tell you why.

How to detect an OOM kill manually

# Check exit code of a stopped container
docker inspect my-container | grep -E '"ExitCode"|"OOMKilled"'
# Output:
# "ExitCode": 137,
# "OOMKilled": true,

# See recent system-level OOM messages
dmesg | grep -i "oom"
# Output:
# [1234567.123456] Out of memory: Kill process 12345 (node) 
#   score 987 or sacrifice child

Common OOM kill causes

Memory limit set too low — the container needs more than you gave it. Solution: increase the memory limit in your docker-compose.yml:

services:
  api:
    image: myapp:latest
    deploy:
      resources:
        limits:
          memory: 512m  # Increase this if OOM killing
        reservations:
          memory: 256m

Memory leak — the application isn't releasing memory it should. The memory limit is actually protecting you here by capping the blast radius. But you need to fix the leak.
Temporary spike — some workloads need more memory during cold start or while processing large payloads. Consider if a higher limit or horizontal scaling is the answer.

Push-based vs pull-based monitoring

Understanding this architecture distinction helps you choose the right tool.

Pull-based (Prometheus model)

The monitoring system periodically sends HTTP requests to your services asking "what are your current metrics?" Your services must expose an HTTP endpoint (a "metrics endpoint") that responds with metrics in a specific format.

Pros:

The monitoring system controls the scrape interval
Easy to see which targets are up/down (if scraping fails, the target is probably down)
Well-suited for service discovery in dynamic environments (k8s)

Cons:

Requires network access from monitoring system to every monitored host
Every container must expose a metrics endpoint (usually requires adding a library or exporter)
If the network between monitoring and targets is down, you have a blind spot

Push-based (Kernus model)

Your services proactively send metrics to a central collection point. The agent running on your host pushes container metrics to the cloud.

Pros:

Works behind firewalls and NAT — the agent only needs outbound internet access
No changes to your containers — the agent reads Docker's API directly
Monitoring continues even if the central server has a brief outage (agent buffers)
Much simpler firewall rules: outbound 443 only

Cons:

You don't know a host is down until you stop receiving its metrics (requires "heartbeat" logic)
The monitoring service must handle pushes at scale

Most SaaS monitoring tools (including Kernus) use push-based collection because it works in more network environments without firewall changes. Prometheus uses pull-based collection because it was designed for internal Kubernetes environments where network access is controlled.

Setting up alerts that don't wake you up for nothing

Alert fatigue is real. If your monitoring sends an alert every time CPU spikes for 2 seconds, you'll start ignoring alerts — which defeats the purpose entirely. Good alerting has two properties: high recall (catches real problems) and high precision (doesn't fire on false positives).

Duration-based alerting

Don't alert on instantaneous values. Alert on sustained values. A container at 98% CPU for 3 minutes is a problem. A container at 98% CPU for 3 seconds is a normal startup spike.

Alert: CPU above 85% for 5 consecutive minutes
Alert: Memory above 80% for 10 consecutive minutes
Alert: Restart count increased by 3 in the last 30 minutes

Separate alerts by severity

Not everything needs to wake someone up at 3 AM. Structure your alerts:

Severity	Trigger	Channel
P0 (Page)	Container down, OOM kill loop	Phone call / SMS
P1 (Urgent)	High restart count, sustained 90%+ CPU	Slack + SMS
P2 (Warning)	Memory leak pattern, approaching limits	Slack
P3 (Info)	Unusual network activity, first restart	Email digest

Container-specific thresholds

A database container at 60% memory is normal. A stateless API at 60% memory might indicate a leak. Don't use the same thresholds for every container — tune them based on what you know about each service's behavior.

Debugging a container in production

When something is wrong, here's a systematic approach:

Step 1: Check current status

# See all containers and their state
docker ps -a

# Key columns: STATUS, CREATED, PORTS
# Look for: "Restarting", "Exited (137)", high restart count

Step 2: Read the exit code

docker inspect <container_name> --format='{{.State.ExitCode}} {{.State.OOMKilled}}'
# "0 false"   → clean stop (intentional)
# "1 false"   → application crash (read logs)
# "137 true"  → OOM kill (increase memory limit)
# "137 false" → SIGKILL from outside (deployment? manual kill?)
# "143 false" → SIGTERM (graceful shutdown, likely intentional)

Step 3: Read the logs

# Last 100 lines of logs
docker logs --tail 100 <container_name>

# Follow logs in real time
docker logs -f <container_name>

# Logs from a specific time window
docker logs --since 2025-01-15T14:30:00 <container_name>

Look for:

Stack traces (the line before the panic/exception is usually the cause)
"connection refused" (dependency not ready)
"out of memory" or "cannot allocate" (memory issue)
"address already in use" (port conflict)
"permission denied" (volume mount issue or file permissions)

Step 4: Check resource usage

# Real-time stats for all containers
docker stats

# One-shot snapshot
docker stats --no-stream

If a container is at 100% CPU but producing no output, it's likely stuck in an infinite loop or waiting on something.

Step 5: Check dependencies

Many container crashes aren't about the container itself — they're about a dependency that isn't ready:

# Check if your database is accessible from inside the container
docker exec <container_name> nc -zv database-host 5432

# Check if an external URL is reachable
docker exec <container_name> curl -sf https://api.example.com/health

Tool options for Docker monitoring

cAdvisor (free, self-hosted)

Google's cAdvisor is a standalone container that exposes Docker metrics. Run it, hit its web UI on port 8080, see basic container CPU/memory/network.

Pros: Free, simple, no external dependencies
Cons: No alerting, no persistent history, no multi-host, UI is basic

Best for: dev environments, quick inspection.

Prometheus + Grafana (free, complex)

As covered in our Prometheus guide, this is the industrial-grade self-hosted option. Powerful but requires significant setup and maintenance.

Pros: Extremely flexible, huge ecosystem, handles k8s well
Cons: 5+ containers to maintain, 3-5 hours setup, ongoing maintenance

Best for: large teams with a dedicated platform engineer.

Netdata (free tier available, self-hosted)

Netdata is a high-resolution (per-second) monitoring tool that runs on each host. Easy to install, built-in dashboards, some alerting built in.

Pros: Easy install, beautiful dashboards, very granular metrics
Cons: Primarily per-host (multi-host requires their cloud), free tier limits apply

Best for: teams that want detailed per-host visibility and are comfortable with self-hosted.

Datadog (paid, hosted)

Industry-standard enterprise monitoring. APM, log management, 800+ integrations, ML anomaly detection.

Pros: Most comprehensive platform, excellent k8s support
Cons: Expensive ($200-500+/month for small teams), complex pricing

Best for: large teams with budget, need for APM or extensive integrations.

Kernus (paid, hosted, Docker-focused)

Purpose-built for Docker monitoring. Two commands to set up, automatic container discovery, alerts on all channels.

Pros: Fastest setup (2 minutes), includes status page + uptime badges + digest emails, flat pricing
Cons: Docker-only (no k8s), no APM, no custom metrics from your application

Best for: small-to-mid teams with Docker-first infrastructure who want monitoring without the operational overhead.

The minimum viable monitoring setup

If you do nothing else, do these three things:

Alert when a container stops unexpectedly — set up at least an email alert when any production container enters an exited state
Alert on high restart counts — a container restarting 5 times is a problem, 20 times is a crisis
Track OOM kills — even if you don't alert on them immediately, log them so you know when to increase memory limits

Everything beyond this is additive. Start here.

For a deep dive on one specific problem: OOM kills in Docker — how to detect and prevent them. For setting up alert channels specifically: How to set up Docker container alerts for Slack, Discord, and Telegram.

Start monitoring your Docker containers free →

Try Kernus free

Set up Docker monitoring in 2 minutes. Free for 1 host — no credit card required.

Start monitoring

← Back to all posts