How We Monitor Platform Health 24/7 With Custom Tooling

Automated health checks, fleet-wide DNS sync, and Discord alerts — the homegrown monitoring stack keeping TrueCore's infrastructure running around the clock.

Most small hosting companies use third-party monitoring services: UptimeRobot, Datadog, New Relic. These are good tools. We use something different: monitoring software we wrote ourselves, running on our own infrastructure.

Here's how it works and why we built it this way.

The Watchman

The core of our monitoring is a daemon we call the Watchman. It runs continuously on our primary server and checks the health of every service in our infrastructure every five minutes.

What it checks:

TCP connectivity — is the port open and accepting connections?
HTTP responses — does the service return the expected status code?
SMTP handshakes — does the mail server greet correctly?
DNS resolution — does our domain resolve to the correct IP?

The Watchman writes a structured journal of every check result: timestamp, service name, check type, result, and detail. This journal is the source of truth for our platform health history.

flame-guardian and Guardian-X

flame-guardian watches the SSH auth log and nginx access log in real time. When it detects a brute-force attack pattern or a probe for a hostile path, it bans the offending IP using nftables and records the event. Bans escalate through 10 tiers automatically — repeat offenders get progressively longer bans up to permanent Guardian Jail.

Guardian-X sits below flame-guardian at the XDP kernel layer, dropping packets from the highest-tier permanent bans before they even reach nftables. It's the fastest possible enforcement point: below the network stack, before any connection state is allocated.

Both services report ban activity to Discord, giving a real-time feed of attack traffic patterns without requiring anyone to watch log files.

The Sentinel

The Sentinel is a heartbeat service. Every 60 seconds it posts a status update to our Discord channel. If the Sentinel stops posting, something has gone very wrong with the primary server.

This is the simplest and most reliable form of monitoring: the absence of a signal is itself a signal. Unlike a monitoring service that polls from outside, the Sentinel failure means we've lost the primary server entirely.

DNS Sync

Our DNS infrastructure uses three nameservers (NS1, NS2, NS3) across separate locations. When we update a zone on NS1, we need to propagate the change to NS2 and NS3.

We built spark-sync: an inotify-driven service that watches the zone directory on NS1 for changes. When it detects a modification, it rsyncs the updated zones to NS2 and NS3 over WireGuard, confirms with a hash check, and logs the result.

Zone propagation to all nameservers typically happens within 5 seconds of a change on NS1.

Alerting

All monitoring systems route alerts to Discord. We use separate channels for different severity levels:

#alerts: service failures, Guardian ban spikes, Sentinel silence
#health: Watchman cycle summaries, DNS sync results
#infra: deployment events, certificate renewals, server maintenance

This keeps critical alerts visible without noise. An engineer receiving a page for a certificate auto-renewed successfully is an engineer who starts ignoring pages.

flame doctor

We also have an on-demand diagnostic tool called flame doctor that checks the full platform state: network connectivity, service status, certificate validity, disk usage, fleet node reachability. We run it before and after maintenance windows.

The output is human-readable and designed to be actionable: each check is pass/fail with a clear description of what failed and how to fix it.

We built all of this rather than buying it because it's tightly integrated with our infrastructure. Our monitoring knows about our fleet topology, our service names, our alert channels. A generic monitoring service would require adapting it to our setup. Our own tooling is the setup.