\n\n\n\n Building Bot Infrastructure That Doesn’t Catch Fire - BotClaw Building Bot Infrastructure That Doesn’t Catch Fire - BotClaw \n

Building Bot Infrastructure That Doesn’t Catch Fire

📖 7 min read1,381 wordsUpdated Mar 30, 2026


That time a “simple” bot melted a database

A few years ago I pushed what I thought was a tiny feature to a production bot. It just needed to backfill some missing user metadata. No big deal. Ten minutes later, the primary database was crying, alerts were blowing up my phone, and the bot was in a restart loop.

The bug wasn’t in the business logic. The bug was in the infrastructure I didn’t bother to design.

I’d shipped a bot that talked to the database directly, in-line, on every message. No queue. No rate limits. No back pressure. When one external API slowed down, workers piled up, connection counts spiked, and the DB got punched in the face.

That was the day I stopped thinking “eh, it’s just a bot” and started treating bots like any other production system that can absolutely wreck your stack if you’re careless.

If you’re building bots that real users depend on, you need infrastructure. Not giant-kubernetes-cathedral infrastructure. Just the right pieces, wired correctly, so the bot doesn’t fall over the second it gets popular or an API endpoint misbehaves.

The minimum viable bot architecture that actually works

Let’s talk core pieces. When I say “bot infrastructure,” I mean the stuff that lets your bot:

  • Survive traffic spikes
  • Handle slow or flaky external APIs
  • Recover from bugs without manual babysitting
  • Tell you what’s broken before users do

You don’t need twenty services to do that. You need a handful of boring ones, used consistently.

1. Stateless entrypoint

Your bot’s entrypoint (webhook/HTTP endpoint/poller) should be as stateless and quick as possible.

  • Receive event (Slack, Discord, WhatsApp, whatever).
  • Validate it (signature, timestamp).
  • Drop it on a queue.
  • ACK the platform fast.

On one Slack bot, we kept the HTTP response time under 150ms consistently by doing almost nothing in the request handler. All the expensive work went to workers. That alone eliminated 90% of the random timeout issues people blame on “Slack being weird.”

2. A real queue, not “just a goroutine”

I like RabbitMQ and Redis (via something like bullmq or rq) for most bots. AWS SQS is fine too. Pick one and commit.

Your queue is where you:

  • Decouple ingestion from processing
  • Control concurrency (max in-flight jobs)
  • Handle retries and dead letters

For a bot I shipped in 2023 that processed GitHub webhooks, we started with no queue “to keep it simple.” At 3k events/minute during a big CI outage, the app servers fell over because every request tried to do everything. We moved to SQS + worker pool, capped at 50 concurrent jobs, and the same load barely touched CPU.

3. Worker processes that you can scale and kill

A worker should do one thing: pull messages from the queue and process them. That’s it. No serving HTTP. No mixing roles.

A sane worker:

  • Has a clear concurrency limit (e.g., 20 jobs per pod)
  • Implements timeouts on external calls (e.g., 3–5 seconds)
  • Uses circuit breaker logic to back off on persistent failures
  • Emits metrics per job type

When CPU spikes, you add more workers. When something is broken, you can scale workers down to zero without touching the ingestion endpoint.

4. Storage that matches the bot’s actual use

Bots usually need three kinds of storage:

  • Config/state (user preferences, tokens) → Postgres, DynamoDB, whatever you actually know how to operate.
  • Sessions (conversation context) → Redis with TTLs or your vector store if it’s LLM-heavy.
  • Logs/events (for analysis and debugging) → ClickHouse, BigQuery, or even S3 + parquet.

The mistake is throwing everything into one database. That’s how you end up with a slow, expensive, mystery box that nobody wants to touch.

Failures you will hit (and how not to panic)

Every bot that survives more than a month in production runs into the same 5–6 classes of failure. You can either prepare for them now, or debug them at 3AM while staring at a Grafana dashboard with shaking hands.

1. External APIs go slow or down

If your bot calls Slack, Discord, OpenAI, or some third-party CRM: assume they will absolutely misbehave.

  • Use timeouts for every request.
  • Implement retries with jitter and exponential backoff.
  • Use a circuit breaker so you stop hammering a broken service.

On a Telegram bot that calls OpenAI’s API, we cap concurrent OpenAI calls at 40 per region, and we do 3 retries with backoff (250ms, 1s, 4s). During one regional outage in 2024, requests slowed to 20+ seconds, but the rest of the system stayed alive because we weren’t piling up thousands of blocked workers.

2. Message storms and retries from the platform

Slack, WhatsApp, and others will retry webhooks if your endpoint is slow or returns 5xx. If your bot is slow and your handler isn’t idempotent, you get double-processing. Or triple.

Fix it by:

  • Making handlers idempotent using a dedupe key (request ID, message ID).
  • Storing “processed” markers in a fast store (Redis set with TTL is fine).
  • Logging duplicates so you can actually see them happen.

3. Runaway jobs

A job that never finishes is worse than a job that fails fast. It eats concurrency and quietly starves the queue.

Use:

  • Per-job hard timeouts (e.g., 30 seconds for normal jobs, 5 minutes for heavy ones).
  • Cancellation hooks where your HTTP client or LLM client respects context cancellation.
  • Max retries (e.g., 5) before sending to dead-letter queue.

Observability: how you know the bot is dying before users do

If you don’t have visibility, you don’t have infrastructure. You have vibes.

At minimum, you need three things:

1. Metrics

Prometheus + Grafana is still my default choice. Track:

  • Queue depth per job type
  • Job processing latency (p50/p95)
  • Error rate per integration (Slack, OpenAI, CRM, etc.)
  • Rate of dead-lettered jobs

A simple alert like “queue depth > 10,000 for 5 minutes” has saved me more times than any complicated anomaly detection.

2. Logs

Use structured logs. I don’t care if it’s Loki, ELK, or Datadog. Just:

  • Include correlation IDs (per user, per conversation, per job).
  • Log external API responses when they fail (at least status + error).
  • Log retries and backoff decisions.

3. Traces (optional but very nice)

OpenTelemetry is actually worth it for bots that touch multiple services. Tracing a single user message from webhook → queue → worker → external API → response makes debugging 10x faster.

What I’d build if I were starting a bot today

If you told me “I’m starting a bot project this weekend, help me not regret my life in three months,” here’s what I’d suggest:

  • HTTP entrypoint: FastAPI / Express / Go net/http behind a basic load balancer.
  • Queue: Redis + bullmq (Node) or rq/dramatiq (Python). Or SQS if you live on AWS.
  • Workers: separate deployment/process, horizontal autoscaling based on queue depth.
  • Storage: Postgres for config, Redis for sessions, S3 for long-term logs or transcripts.
  • Metrics: Prometheus + Grafana, alerts wired on queue depth and error rate.
  • Secrets: Vault or your cloud’s secret manager. Never hardcode tokens. Ever.

None of this is fancy. That’s the point. Bots don’t fail because they lack some new shiny technology. They fail because the basics are ignored.

If you treat your bot like a toy, your users will feel it. If you treat it like any other production service, you can sleep through the night instead of debugging Slack retries in bed.

FAQ

How many workers do I actually need for my bot?

Start small. Take your average job time and your expected peak messages per second. If jobs take 200ms and you expect 20 messages/sec, that’s 4 concurrent jobs. Multiply by 2–3x for safety, so maybe 10 workers. Then watch CPU, latency, and queue depth, and adjust based on real data.

Do I really need a queue for a low-traffic bot?

Under 5 messages per minute and no heavy external calls? You can get away without one. The second you add expensive operations (LLM calls, DB writes, third-party APIs), add a queue. It’s cheaper than debugging weird timeouts and double-processing later.

Is Kubernetes overkill for bot infrastructure?

For most bots, yes, at least at the start. A few Docker containers on ECS, Fly.io, or plain VMs with systemd are fine. Kubernetes starts to make sense when you have multiple bots, shared services, or strict multi-tenant isolation. Until then, keep it boring and simple.


🕒 Published:

🛠️
Written by Jake Chen

Full-stack developer specializing in bot frameworks and APIs. Open-source contributor with 2000+ GitHub stars.

Learn more →
Browse Topics: Bot Architecture | Business | Development | Open Source | Operations

More AI Agent Resources

BotsecAgntupAgntboxAgntdev
Scroll to Top