March 26, 20258 min read

How to get alerted when a Docker container goes down

Set up a Docker container down alert in under 5 minutes. Email, Slack, Discord — step by step. The complete guide for anyone who needs to know now.

dockeralertsmonitoringcontainersuptime

If you're reading this, something probably broke and you found out the wrong way. Or you're smart and you're setting things up before something breaks. Either way, this guide exists to solve one specific problem: getting alerted when a Docker container goes down. No fluff. Get through this page and you'll have an alert working in under 5 minutes.

Why polling `docker ps` manually doesn't work

# This is not monitoring. This is hoping.
watch -n 60 docker ps

watch with docker ps is a common "quick fix" that teams use when a container first breaks. It works while you're watching. It doesn't work when you close your laptop, go to sleep, or context-switch to another task. You need a system that watches for you and pushes a notification when something breaks.

The three ways a container can go down

Understanding how a container stops helps you understand what to alert on:

1. Application crash (exit code 1, 2, etc.)

The process inside the container exits with a non-zero code. This is an application bug — exception not caught, missing config, failed database query at startup. Docker records the exit code. If there's a restart policy, Docker restarts it; if not, it stays down.

2. OOM kill (exit code 137, OOMKilled=true)

The container exceeds its memory limit. The Linux kernel kills the process. Docker restarts it if configured to. Without monitoring, you might not know this happened — Docker just quietly killed and restarted it, potentially multiple times an hour.

3. Manual stop or deployment

docker stop, docker-compose down, a CI/CD deployment, or an admin with docker rm -f access. The container exits with code 0 (clean) or 143 (SIGTERM). If your monitoring alerts on every stop including intentional ones, you'll have alert fatigue. Good monitoring distinguishes between unexpected stops and expected ones.

Setting up a container down alert in under 5 minutes

Option A: Kernus (fastest)

This is the fastest path to a working container down alert:

# 1. Install the agent (30 seconds)
curl -fsSL https://kernus.app/install | sh

# 2. Start monitoring (10 seconds)
kernus token YOUR_TOKEN --host company-backend
kernus agent start

Windows: Use WSL or Git Bash for this one-liner, or use PowerShell (see Windows install).

In the Kernus dashboard:

Go to Alerts → Create Rule
Condition: Container is down
Duration: 1 minute (gives Docker time to restart before alerting)
Filter: leave blank to monitor all containers, or set a container name pattern
Channels: select your Slack/Discord/Telegram/Email channel

That's it. Within 2 minutes of setup, any container that stops unexpectedly and doesn't restart within 1 minute will send an alert to your configured channels.

What the alert contains:

Container name and which host it's on
Exit code and exit reason (OOM kill, SIGTERM, application crash)
Last 5-10 log lines from the container at time of exit
Direct link to the container in your dashboard

The log context is critical. When a container goes down at 3 AM and you get a Slack notification, you want to understand why it stopped without SSHing into the server first.

Option B: Bash script + cron (hackery, but works)

If you want zero dependencies and don't mind a bash solution:

#!/bin/bash
# save as /usr/local/bin/check-containers.sh

SLACK_WEBHOOK="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
CONTAINERS_TO_MONITOR=("api-service" "worker" "postgres")

for container in "${CONTAINERS_TO_MONITOR[@]}"; do
  status=$(docker inspect --format='{{.State.Status}}' "$container" 2>/dev/null)
  
  if [ "$status" != "running" ]; then
    exit_code=$(docker inspect --format='{{.State.ExitCode}}' "$container" 2>/dev/null)
    message="🚨 Container *$container* is $status (exit code: $exit_code) on $(hostname)"
    
    curl -s -X POST "$SLACK_WEBHOOK" \
      -H 'Content-type: application/json' \
      -d "{\"text\": \"$message\"}"
  fi
done

Add to cron to run every minute:

chmod +x /usr/local/bin/check-containers.sh
echo "* * * * * root /usr/local/bin/check-containers.sh" | sudo tee /etc/cron.d/docker-monitor

Why this is not great:

Polling every minute means up to 60 seconds of downtime before alert
No log context, no exit code analysis
No history — if it crashed and restarted in under a minute, you'll never know
You're responsible for the script, cron job, and webhook URL
Doesn't handle Docker restarts well (could alert on intentional restarts)

Use this as a temporary fix while you set up something proper.

Option C: Prometheus + Alertmanager

For teams already running Prometheus:

# alert_rules.yml
groups:
  - name: docker
    rules:
      - alert: ContainerDown
        expr: absent(container_last_seen{name=~".+"})
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Container {{ $labels.name }} is not reporting"
          description: "{{ $labels.name }} on {{ $labels.instance }} hasn't reported in 1 minute"

This fires when a container stops reporting to cAdvisor. Alertmanager routes it to Slack/PagerDuty/etc.

The limitation: absent() doesn't know why the container stopped, what exit code it had, or what the last log lines were. You just know it went away.

Response time: how quickly do you need to know?

This depends on your service and your users:

Service type	Acceptable downtime	Alert channel
Revenue-critical API	< 1 minute	SMS + phone call
Customer-facing web service	< 5 minutes	Slack + SMS
Background worker / queue consumer	< 30 minutes	Slack or Discord
Dev/staging environment	Hours	Email digest
Cron job / batch processor	End of business day	Email

For revenue-critical services, configure your alert with a 0-minute duration (alert immediately on any stop) and route to SMS or a PagerDuty-style on-call rotation.

For background workers that are less critical, allow 5-10 minutes of downtime before alerting — Docker may restart it successfully in that window.

Testing your alert before it matters

This is the single most important step that teams skip:

# 1. Identify a non-critical container to test with
docker ps

# 2. Stop it
docker stop your-test-container

# 3. Watch your Slack/Discord/Telegram — alert should arrive within 1-2 minutes
# (Or within your configured duration threshold)

# 4. Start it back up
docker start your-test-container

# 5. Verify the "resolved" alert fires (if your tool supports it)

If the alert doesn't arrive, something is misconfigured. Better to find out now than at 3 AM.

Common reasons alerts don't fire during testing:

Wrong container name filter (alert only applies to containers matching a pattern)
Webhook URL is incorrect or the Slack app was revoked
Alert rule is disabled
The notification channel app doesn't have permission to post in that channel (Discord)

Multiple containers: per-service vs catch-all alerting

For a Docker Compose stack with 5-10 services, you have two approaches:

Catch-all alert

Condition: Container is down
Filter: (all containers)
Duration: 2 minutes
Channel: #production-alerts

Simple, but you'll get alerts for intentional teardowns (when you docker-compose down for a deployment). Mitigate this by using a filter that excludes containers you frequently restart.

Per-service alert with different urgencies

Alert 1: "api-gateway is down"
  Filter: api-gateway
  Duration: 0 minutes (immediate)
  Channels: SMS + Slack

Alert 2: "background-worker is down"
  Filter: worker-*
  Duration: 5 minutes
  Channels: Slack only

More configuration, but better signal quality. P0 services get immediate SMS alerts; P2 services get Slack-only alerts with a grace period.

What a good alert looks like

Compare these two alert messages:

Bad:

ALERT: Container is down
Host: prod-01

Good:

🚨 api-gateway is down
Host: prod-server-1 | Exit code: 137 (OOM Kill)
Memory at time of kill: 511MB / 512MB limit

Last logs:
  [2025-01-15 03:42:11] Processing request for user 1234567
  [2025-01-15 03:42:11] Fetching report data (4GB dataset)
  [2025-01-15 03:42:12] Allocating output buffer...
  [process killed]

→ View in dashboard: https://app.kernus.app/dashboard/containers/abc123

The good alert tells you what happened, why it happened (OOM, the container tried to load a 4GB dataset), and where to go next. You can diagnose the issue from your phone without opening a laptop.

This is the alert format Kernus sends — container name, host, exit code with classification, and the last log lines captured at time of exit.

If you get this far, do this one thing

If you take one action from this page: set up a single catch-all alert for any container that stops unexpectedly and doesn't restart within 2 minutes. Route it to a channel you actually check — not email, but Slack or Discord where you're already spending your time.

That single alert will catch 80% of production incidents before your users do.

For a full alerting setup with thresholds and channel routing: How to set up alerts for Docker containers (Slack, Discord, Telegram). For understanding why containers crash: Docker container keeps restarting — how to debug and fix.

Set up your first container down alert — free →

Try Kernus free

Set up Docker monitoring in 2 minutes. Free for 1 host — no credit card required.

Start monitoring

← Back to all posts

Why polling docker ps manually doesn't work