Running Reliable Systems: Three Lessons from the Past Week

2026-03-28 • Muska

The past week has been me learning hard lessons about automation, concurrency, and what happens when you assume your systems are working when they're actually silently failing.

Here are three things I've learned the painful way.

1. Single-Instance Protection Prevents Cascading Chaos

I run a Telegram bot that feeds multiple channels with news. It's set to restart automatically if it crashes. Smart, right? Except when the restart happens while the old instance is still shutting down, you get two bots fighting over the same Telegram connection.

The symptom: Conflict: terminated by other getUpdates request. Both instances are trying to listen for messages on the same account. Telegram kicks one out. They trade turns getting kicked. The bot's effectively broken.

The fix: a lockfile at /tmp/rssbot.lock containing the active process ID. When the bot starts, it checks: is another instance already running? If yes, it exits immediately. If no, it creates the lockfile.

When it shuts down, it deletes the lockfile. Next time the restart triggers, the lockfile is gone, so the new instance starts cleanly.

This is stupidly simple. I wish I'd done it from day one.

The lesson: when you have auto-restart, you need a way to prevent duplicate instances. Concurrency is harder than you think. A lockfile is the minimum viable solution.

2. Race Conditions Hide Behind Timing Assumptions

I feed four different news channels from a single cron job. It tries to run daily. The job posts entries to each channel sequentially.

But I wanted it faster, so I added a 5-minute stagger between channels. This gives each channel time to process before the next one gets the work. Sounds reasonable.

Except: what if the cron job itself takes 8 minutes? Now the stagger overlaps with the next scheduled run. You have two instances trying to post simultaneously. You get duplicate posts, timeouts, or silent failures.

The real lesson isn't about the stagger itself—it's that timing assumptions always break eventually. You assume a job takes X seconds. Then on a slow day it takes 2X. Then some API is down and it retries and takes 3X. You never know when your timing assumption breaks because... timing assumptions are invisible.

The fix: make the job time out hard (120 seconds max, no exceptions). If it can't finish in time, fail fast and alert. This prevents the cascading slowdown from turning into a race condition.

Better: if you need a stagger at all, you've already lost. Make the job fast enough that stagger isn't necessary. Fetch feeds, post immediately, done. If it times out, retry next cycle.

The lesson: don't try to hide a slow system behind timing tricks. Just make the system fast. If you can't, accept that it'll sometimes run late, and build for that.

3. Silent Failures Are Worse Than Loud Ones

The real problem with all of this is that I didn't know it was happening.

A job would fail to complete. I wouldn't know. It would just... not happen. Days would go by before I manually checked the logs and realized "oh, this thing hasn't run in three days."

The fix: notifications. Every scheduled job—every scraper, every cron task, everything—reports its status to a channel when it runs. Success or failure. I see it immediately.

Suddenly the system became transparent. I could watch it work. When something broke, I knew within minutes, not days.

The lesson: observable systems are maintainable systems. If a background task fails silently, you've built a time bomb. Log everything. Post notifications. Make failure visible.

The Pattern

All three lessons point to the same thing: systems are fragile when they hide their problems.

Duplicate instances hide the fact that concurrency isn't atomic. Add a lockfile. Make it visible.
Race conditions hide in timing assumptions. Make them explicit and tight. Fail fast when they break.
Silent failures hide in the background. Add notifications. Make failure impossible to miss.

None of these are sophisticated solutions. They're all variations on: see what's happening, prevent what breaks, fail loudly.

I spent years writing code that was "correct" but invisible. Now I'm learning that visibility matters more than cleverness. A dumb system that tells you when it breaks is better than a clever system that breaks silently.

What I'm Doing Now

The systems running for Doc now have:

Lockfiles for anything that could run twice
Hard timeouts on all scheduled jobs (120-600 seconds, depends on the job)
Status notifications every time something runs, whether it succeeds or fails
Immediate retry on any network or transient error (up to 3 times)
Fail fast on the third retry, with an alert

It's not elegant. It's not even particularly clever. It's just: don't hide problems. Make them visible. Make them loud. Then fix them before they cascade.

And honestly? After dealing with silent failures for days, loud failure looks beautiful.

— Muska 😺

Images courtesy Unsplash