How we built a monitoring SaaS with 90% gross margins
Full transparency post on building Kernus: real architecture decisions, actual cost breakdown per customer tier, gross margin calculation, and what we'd do differently.
This is the post I wish existed when we started building Kernus. Full transparency: the architecture decisions that enabled high margins, the actual cost breakdown per customer tier, the real gross margin numbers, and what we'd do differently. If you're building a SaaS product — especially a developer tool — I hope this gives you something concrete to work with.
Kernus is a Docker container monitoring SaaS. We charge $29/month (Pro), $99/month (Business), and $199+/month (Enterprise). This post breaks down why those numbers work economically, what our COGS actually looks like, and the decisions that got us here.
Why we built it
We had a Docker-based production system. We needed monitoring. The options were:
- Datadog — $460+/month for our setup, complex to configure
- Prometheus + Grafana — free but 6-8 hours to set up correctly, ongoing maintenance
- Nothing — which is what we were doing before a production incident changed our mind
The incident: a container OOM killed repeatedly for 4 hours before anyone noticed. We found out when a user filed a support ticket. The container's logs showed exactly what happened, but nobody was watching. We wanted a tool that would have paged us within 5 minutes and included those log lines in the alert. Nothing on the market did that for under $30/month with a 2-minute setup. So we built it.
The architecture decisions that enabled high margins
Decision 1: Push-based collection with edge processing
The canonical monitoring architecture is pull-based: the monitoring server polls each host for metrics. We went push-based from day one, and we went further — we do significant processing on the edge (in the agent running on the customer's host) before sending anything to our servers.
The agent:
- Reads Docker's API directly via the Unix socket
- Computes deltas instead of sending raw values (we don't need "memory used = 512MB" every 30 seconds — we need changes and summary values)
- Deduplicates unchanged states (if a container's CPU has been at 2% for 5 minutes, we don't need to send 10 identical data points)
- Implements local state tracking for OOM detection and exit code correlation
The result: the agent sends approximately 20% of the raw data that a naive implementation would send. This reduces our ingestion infrastructure cost by ~80%.
Decision 2: ClickHouse as the primary time-series store
We evaluated InfluxDB, TimescaleDB, and ClickHouse. ClickHouse won on compression ratio: 10:1 on container metric data. This is the single most important architectural decision for our margins.
At 10:1 compression:
- 1 month of metrics for a 5-host Pro customer: ~500MB raw → ~50MB stored
- 1 month of metrics for a mid-size Business customer (~15 hosts; plan allows up to 30): ~1.5GB raw → ~150MB stored
At ClickHouse prices on Railway (our hosting platform): we're paying literally cents per organization per month for storage. The Business plan's 30-day retention costs us about $0.10-0.30 in storage per organization.
For the full ClickHouse vs InfluxDB analysis, read our technical comparison.
Decision 3: Go for the backend, Next.js for the frontend
Go was the obvious choice for the backend and agent. It compiles to small static binaries, has excellent concurrency primitives for handling many concurrent agent connections, and the Docker API client library is mature.
The agent binary is ~12MB. It runs on 50MB RAM with no other runtime dependencies. Customers can curl | sh install it and run it in under 60 seconds.
The backend handles:
- Agent data ingestion (Go HTTP handler → ClickHouse batch writes)
- Alert evaluation (in-process Go, checking thresholds against incoming metrics)
- Alert delivery (6 notification channels — Email, Slack, Discord, Telegram, Webhook, SMS)
- API for the Next.js frontend (REST)
- Weekly digest email generation
- Stripe webhook handling for billing
One Go binary. No microservices. This is the right choice at our scale — it avoids distributed systems complexity before we need it.
Decision 4: Railway for infrastructure
We host on Railway. This deserves more attention than it usually gets in indie hacker posts.
Railway's pricing model works well for vertical SaaS:
- Services are billed per-resource (CPU seconds, memory GB-hours, network GB)
- No per-server overhead — you pay for what you use
- Deployment is simple:
railway up - PostgreSQL and Redis are first-class citizens in the platform
For our current scale, Railway runs the entire Kernus backend (API, ClickHouse, PostgreSQL, Redis) for approximately $80-120/month total infrastructure cost. This includes all customer data.
The actual cost breakdown per customer tier
This is where it gets concrete. COGS (Cost of Goods Sold) per organization per month:
Free tier: COGS ~$0.15/org/month
| Component | Cost |
|---|---|
| ClickHouse storage (1-day retention, 1 host) | ~$0.01 |
| API calls (limited) | ~$0.01 |
| Email sends (alerts + verification) | ~$0.05 |
| Agent data ingestion | ~$0.03 |
| Stripe processing | $0 (free tier, no payment) |
| Shared infrastructure allocation | ~$0.05 |
| Total COGS | ~$0.15/month |
Revenue: $0. Gross margin: undefined (loss leader — free tier exists for acquisition).
Pro tier: COGS ~$1.50-3.00/org/month
| Component | Cost |
|---|---|
| ClickHouse storage (7-day retention, 5 hosts) | ~$0.30 |
| API calls | ~$0.20 |
| Alert notifications (Slack, email, etc.) | ~$0.30-0.80 |
| Email sends (weekly digest, etc.) | ~$0.20 |
| Agent data ingestion (5 hosts) | ~$0.25 |
| Stripe processing (2.9% + $0.30) | ~$1.14 |
| Shared infrastructure allocation | ~$0.30 |
| Total COGS | ~$2.50-3.00/month |
Revenue: $29/month. Gross margin: ($29 - $3.00) / $29 = ~90%
Business tier: COGS ~$8-12/org/month
| Component | Cost |
|---|---|
| ClickHouse storage (30-day retention, ~15 hosts — Business allows up to 30) | ~$1.50 |
| API calls (heavier usage) | ~$0.80 |
| Alert notifications (higher volume) | ~$1.50-3.00 |
| Email sends | ~$0.50 |
| Agent data ingestion (~15 hosts) | ~$0.75 |
| Stripe processing | ~$3.18 |
| Shared infrastructure allocation | ~$1.20 |
| Total COGS | ~$9.50-10.00/month |
Revenue: $99/month. Gross margin: ($99 - $10) / $99 = ~90%
The pattern holds. Gross margins stay around 90% across tiers because the primary costs scale with usage (storage, notifications) and the architecture is efficient enough that the storage cost per org is very low.
The pieces that nearly killed the margins
Alert notification costs
SMS alerts are expensive. Twilio charges ~$0.0075/message (US), more for international. A Business customer who sets up SMS alerts for 10 alert rules, firing 5 alerts/day = 150 SMS/month = $1.12/org/month just in SMS costs. Add international destinations and it compounds.
We addressed this by: (1) rate-limiting SMS alerts per plan, (2) not including SMS in the Pro tier (only Business+), and (3) not charging a loss-leader price on SMS — Business at $99 has the margin to absorb it.
Email is cheap (we use Resend — ~$0.001/email). Slack/Discord/Telegram webhooks are free. The expensive channels (SMS, phone calls) are gated behind higher tiers.
Data ingestion at scale
We use batch inserts to ClickHouse — agents push metrics every 30 seconds as a batch. This amortizes the per-request cost across many data points. An agent handling 20 containers sends one HTTP request every 30 seconds with up to 100 data points per request (20 containers × 5 metrics each = 100 data points per batch).
Naively implementing this as 100 individual inserts would have been ~30× more expensive.
Multi-tenancy in ClickHouse
Every query must be filtered by org_id. We shard tables by org_id in ClickHouse using ENGINE = MergeTree() PARTITION BY toYYYYMM(collected_at) ORDER BY (org_id, container_id, collected_at). This ensures ClickHouse's data skipping indices work effectively — queries for one org don't scan data for others.
Getting this wrong early would have made per-org query costs unacceptable at scale.
What we'd do differently
1. Log aggregation earlier
Customers ask for log search. Right now, we capture log snapshots at crash time and include them in alerts — but we don't have a full log aggregation pipeline. Building log ingestion on ClickHouse (which Loki does under the hood anyway) with our push-based architecture would have been natural. We're doing it now, but we wish we'd done it in the first six months.
2. Usage-based alerts earlier
We built plan-based limits (alerts per day, hosts per org) but we should have also built real-time usage dashboards for our own business. "How many alert events are we processing per org per day?" is a question that's hard to answer quickly because we didn't build the internal tooling for it early enough.
3. Pricing validation
We launched with Pro at $19/month. We moved it to $29/month based on conversations with users and almost no churn. We should have started at $29 or even $39. The monitoring category supports higher pricing than we thought — when your product is "we tell you when your production infrastructure is broken," people pay for reliability.
Where the ceiling is and how to break it
At ~90% gross margin, the main growth constraint is customer acquisition, not unit economics. Each additional paying customer is almost entirely incremental margin after the customer acquisition cost is recovered.
The ceiling risk is infrastructure disruption: if ClickHouse or Railway changes pricing significantly, or if AWS/GCP decides to build a competing product and undercut us on price, the economics change. Mitigation: staying lean, having enough margin to absorb price changes, building customer loyalty.
The upside: with 90% gross margins and customer acquisition as the primary constraint, marketing dollars go very far. Acquiring a customer for $50-100 (CAC from SEO and content) who pays $29/month pays back in 2-4 months. That's healthy SaaS economics.
Lessons for anyone building a developer tool SaaS
-
Pick the right storage engine from day one. Wrong choice here costs you in margins forever. We evaluated before we built; it was worth the few weeks.
-
Do math on the expensive operations early. Alert notifications, emails, SMS — know what they cost per-customer before you set your pricing.
-
Price based on value, not cost. "What does it cost us?" is the floor. "What would you pay to know immediately when your production containers crash?" is the ceiling. There's a lot of room between them.
-
Flat pricing is a feature. Usage-based pricing optimizes SaaS-provider revenue at the expense of customer trust. Developers hate unpredictable bills. We charge by host count — customers know their bill on day one.
-
Engineer your COGS, not just your product. The architecture decisions that got us to 90% margins enabled us to price competitively. Those decisions can't be made later without a full rewrite.
If you're building a developer tool: the ClickHouse vs InfluxDB technical comparison and the self-hosted vs SaaS tradeoffs might be useful context.
If you just want to monitor your Docker containers without paying Datadog prices: Kernus is $29/month for 5 hosts →
Try Kernus free
Set up Docker monitoring in 2 minutes. Free for 1 host — no credit card required.
Start monitoring