How to get alerted when a Docker container goes down
Set up a Docker container down alert in under 5 minutes. Email, Slack, Discord — step by step. The complete guide for anyone who needs to know now.
If you're reading this, something probably broke and you found out the wrong way. Or you're smart and you're setting things up before something breaks. Either way, this guide exists to solve one specific problem: getting alerted when a Docker container goes down. No fluff. Get through this page and you'll have an alert working in under 5 minutes.
Why polling docker ps manually doesn't work
# This is not monitoring. This is hoping.
watch -n 60 docker ps
watch with docker ps is a common "quick fix" that teams use when a container first breaks. It works while you're watching. It doesn't work when you close your laptop, go to sleep, or context-switch to another task. You need a system that watches for you and pushes a notification when something breaks.
The three ways a container can go down
Understanding how a container stops helps you understand what to alert on:
1. Application crash (exit code 1, 2, etc.)
The process inside the container exits with a non-zero code. This is an application bug — exception not caught, missing config, failed database query at startup. Docker records the exit code. If there's a restart policy, Docker restarts it; if not, it stays down.
2. OOM kill (exit code 137, OOMKilled=true)
The container exceeds its memory limit. The Linux kernel kills the process. Docker restarts it if configured to. Without monitoring, you might not know this happened — Docker just quietly killed and restarted it, potentially multiple times an hour.
3. Manual stop or deployment
docker stop, docker-compose down, a CI/CD deployment, or an admin with docker rm -f access. The container exits with code 0 (clean) or 143 (SIGTERM). If your monitoring alerts on every stop including intentional ones, you'll have alert fatigue. Good monitoring distinguishes between unexpected stops and expected ones.
Setting up a container down alert in under 5 minutes
Option A: Kernus (fastest)
This is the fastest path to a working container down alert:
# 1. Install the agent (30 seconds)
curl -fsSL https://kernus.app/install | sh
# 2. Start monitoring (10 seconds)
kernus token YOUR_TOKEN --host company-backend
kernus agent start
Windows: Use WSL or Git Bash for this one-liner, or use PowerShell (see Windows install).
In the Kernus dashboard:
- Go to Alerts → Create Rule
- Condition: Container is down
- Duration: 1 minute (gives Docker time to restart before alerting)
- Filter: leave blank to monitor all containers, or set a container name pattern
- Channels: select your Slack/Discord/Telegram/Email channel
That's it. Within 2 minutes of setup, any container that stops unexpectedly and doesn't restart within 1 minute will send an alert to your configured channels.
What the alert contains:
- Container name and which host it's on
- Exit code and exit reason (OOM kill, SIGTERM, application crash)
- Last 5-10 log lines from the container at time of exit
- Direct link to the container in your dashboard
The log context is critical. When a container goes down at 3 AM and you get a Slack notification, you want to understand why it stopped without SSHing into the server first.
Option B: Bash script + cron (hackery, but works)
If you want zero dependencies and don't mind a bash solution:
#!/bin/bash
# save as /usr/local/bin/check-containers.sh
SLACK_WEBHOOK="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
CONTAINERS_TO_MONITOR=("api-service" "worker" "postgres")
for container in "${CONTAINERS_TO_MONITOR[@]}"; do
status=$(docker inspect --format='{{.State.Status}}' "$container" 2>/dev/null)
if [ "$status" != "running" ]; then
exit_code=$(docker inspect --format='{{.State.ExitCode}}' "$container" 2>/dev/null)
message="🚨 Container *$container* is $status (exit code: $exit_code) on $(hostname)"
curl -s -X POST "$SLACK_WEBHOOK" \
-H 'Content-type: application/json' \
-d "{\"text\": \"$message\"}"
fi
done
Add to cron to run every minute:
chmod +x /usr/local/bin/check-containers.sh
echo "* * * * * root /usr/local/bin/check-containers.sh" | sudo tee /etc/cron.d/docker-monitor
Why this is not great:
- Polling every minute means up to 60 seconds of downtime before alert
- No log context, no exit code analysis
- No history — if it crashed and restarted in under a minute, you'll never know
- You're responsible for the script, cron job, and webhook URL
- Doesn't handle Docker restarts well (could alert on intentional restarts)
Use this as a temporary fix while you set up something proper.
Option C: Prometheus + Alertmanager
For teams already running Prometheus:
# alert_rules.yml
groups:
- name: docker
rules:
- alert: ContainerDown
expr: absent(container_last_seen{name=~".+"})
for: 1m
labels:
severity: critical
annotations:
summary: "Container {{ $labels.name }} is not reporting"
description: "{{ $labels.name }} on {{ $labels.instance }} hasn't reported in 1 minute"
This fires when a container stops reporting to cAdvisor. Alertmanager routes it to Slack/PagerDuty/etc.
The limitation: absent() doesn't know why the container stopped, what exit code it had, or what the last log lines were. You just know it went away.
Response time: how quickly do you need to know?
This depends on your service and your users:
| Service type | Acceptable downtime | Alert channel |
|---|---|---|
| Revenue-critical API | < 1 minute | SMS + phone call |
| Customer-facing web service | < 5 minutes | Slack + SMS |
| Background worker / queue consumer | < 30 minutes | Slack or Discord |
| Dev/staging environment | Hours | Email digest |
| Cron job / batch processor | End of business day |
For revenue-critical services, configure your alert with a 0-minute duration (alert immediately on any stop) and route to SMS or a PagerDuty-style on-call rotation.
For background workers that are less critical, allow 5-10 minutes of downtime before alerting — Docker may restart it successfully in that window.
Testing your alert before it matters
This is the single most important step that teams skip:
# 1. Identify a non-critical container to test with
docker ps
# 2. Stop it
docker stop your-test-container
# 3. Watch your Slack/Discord/Telegram — alert should arrive within 1-2 minutes
# (Or within your configured duration threshold)
# 4. Start it back up
docker start your-test-container
# 5. Verify the "resolved" alert fires (if your tool supports it)
If the alert doesn't arrive, something is misconfigured. Better to find out now than at 3 AM.
Common reasons alerts don't fire during testing:
- Wrong container name filter (alert only applies to containers matching a pattern)
- Webhook URL is incorrect or the Slack app was revoked
- Alert rule is disabled
- The notification channel app doesn't have permission to post in that channel (Discord)
Multiple containers: per-service vs catch-all alerting
For a Docker Compose stack with 5-10 services, you have two approaches:
Catch-all alert
Condition: Container is down
Filter: (all containers)
Duration: 2 minutes
Channel: #production-alerts
Simple, but you'll get alerts for intentional teardowns (when you docker-compose down for a deployment). Mitigate this by using a filter that excludes containers you frequently restart.
Per-service alert with different urgencies
Alert 1: "api-gateway is down"
Filter: api-gateway
Duration: 0 minutes (immediate)
Channels: SMS + Slack
Alert 2: "background-worker is down"
Filter: worker-*
Duration: 5 minutes
Channels: Slack only
More configuration, but better signal quality. P0 services get immediate SMS alerts; P2 services get Slack-only alerts with a grace period.
What a good alert looks like
Compare these two alert messages:
Bad:
ALERT: Container is down
Host: prod-01
Good:
🚨 api-gateway is down
Host: prod-server-1 | Exit code: 137 (OOM Kill)
Memory at time of kill: 511MB / 512MB limit
Last logs:
[2025-01-15 03:42:11] Processing request for user 1234567
[2025-01-15 03:42:11] Fetching report data (4GB dataset)
[2025-01-15 03:42:12] Allocating output buffer...
[process killed]
→ View in dashboard: https://app.kernus.app/dashboard/containers/abc123
The good alert tells you what happened, why it happened (OOM, the container tried to load a 4GB dataset), and where to go next. You can diagnose the issue from your phone without opening a laptop.
This is the alert format Kernus sends — container name, host, exit code with classification, and the last log lines captured at time of exit.
If you get this far, do this one thing
If you take one action from this page: set up a single catch-all alert for any container that stops unexpectedly and doesn't restart within 2 minutes. Route it to a channel you actually check — not email, but Slack or Discord where you're already spending your time.
That single alert will catch 80% of production incidents before your users do.
For a full alerting setup with thresholds and channel routing: How to set up alerts for Docker containers (Slack, Discord, Telegram). For understanding why containers crash: Docker container keeps restarting — how to debug and fix.
Try Kernus free
Set up Docker monitoring in 2 minutes. Free for 1 host — no credit card required.
Start monitoring