Skip to main content

Command Palette

Search for a command to run...

RabbitMQ was Never Our Job Runner. We Stopped Mixing Communication with Execution.

Designing Production Systems by Separating Communication Reliability From Execution Reliability.

Updated
5 min read
RabbitMQ was Never Our Job Runner. We Stopped Mixing Communication with Execution.

Hello, my name is Ahmed Eid (eidoox), a Senior Backend Engineer with over 3 years of experience building reliable, scalable, distributed, and real-time systems.

In this Production Insight, I want to share the lessons learned from designing event-driven systems in production environments, and why we stopped treating RabbitMQ consumers as places where business execution happens.

This is not another comparison between RabbitMQ and BullMQ, or a tutorial on how to publish messages and process jobs. It is about understanding a production lesson that becomes obvious only at scale: communication reliability and execution reliability are two completely different engineering problems.

A RabbitMQ message being successfully consumed does not necessarily mean the business work has successfully completed. A consumer may acknowledge the message, while a downstream API times out, a PDF generation process fails, an AI task crashes after several minutes, or a third-party dependency becomes unavailable.

Production systems must distinguish between "an event was delivered" and "a piece of business work was successfully executed."

In this insight, I will explore the architectural decision of using RabbitMQ as the communication layer and BullMQ as the execution layer, how this separation improves reliability, retry strategies, concurrency control, operational visibility, and why mixing these responsibilities can create hidden failures that only appear under real production pressure.

You can learn more about my background through my Personal Website.

Consuming a Message Does Not Mean the Work Is Finished

Many systems start with this architecture:

The assumption is: the message reached a consumer, so the system successfully handled the business operation. But in production, this assumption breaks down.

A consumer may receive the message and start processing:

  • The payment provider responds after 40 seconds.

  • The PDF generation service runs out of memory.

  • The email provider is unavailable.

  • The process crashes after partially completing work.

  • The consumer is deployed and restarts in the middle of execution.

Now you face difficult questions:

  • Should RabbitMQ keep the message unacknowledged for 30 minutes?

  • How many consumers can execute heavy tasks before exhausting CPU or memory?

  • How do you track the progress of a 10-minute AI processing task?

  • How do you apply different retry strategies for different failures?

  • How do you prevent a failed execution from blocking new messages?

The problem is not RabbitMQ.

RabbitMQ successfully solved the communication problem.

The mistake was asking the communication layer to also become the execution manager.

Separating Communication Reliability From Execution Reliability

Production architecture

RabbitMQ guarantees:

  • Events are delivered to target services.

  • Services remain decoupled.

  • Producers do not need to know consumers.

BullMQ guarantees:

  • Jobs survive worker crashes.

  • Retries with exponential backoff.

  • Delayed execution.

  • Concurrency control.

  • Rate limiting.

  • Job progress tracking.

  • Separate queues for different workloads.

The architecture creates two independent reliability boundaries.

A Real Production Example: Invoice Generation After an Order

Direct RabbitMQ execution

Looks simple.

But what happens if:

  • S3 is temporarily unavailable?

  • Email service rate limits you?

  • The worker crashes after uploading but before sending the email?

  • The operation takes several minutes?

Now your messaging consumer is responsible for managing a complete workflow.

Separating execution

The consumer finishes in milliseconds.

The business workflow has its own lifecycle.

Decoupling Infrastructure Health From Business Outcome Correctness

RabbitMQ / Redis / Queue System Metrics:

  • Queue length: stable

  • Consumer ACK rate: 100%

  • No errors in logs

  • Throughput: normal

  • No visible downtime

In distributed systems, we often assume that if a message has been processed successfully, then the business operation has also succeeded.

This assumption is dangerously misleading.

At the infrastructure level, everything can appear perfectly healthy like: queue length: stable, consumer ACK rate: 100% , no errors in logs, throughput: normal, and no visible downtime. From the system’s perspective, nothing is wrong.

But when we shift to the business layer, a very different picture often emerges:

Orders not confirmed ,emails not delivered ,payments not finalized , and notifications missing.

Why This Happens in Production Systems

Message brokers are designed to guarantee delivery and acknowledgment, not business correctness.

They answer questions like:

  • Was the message delivered to a consumer?

  • Did the consumer acknowledge it?

They do NOT answer:

  • Did the payment succeed?

  • Did the email actually send?

  • Did the full workflow complete successfully?

  • Did all downstream dependencies succeed?

This creates a gap between: Message processed successfully
and Business operation completed successfully.

The most dangerous failures in production are not system crashes or outages.

They are silent partial failures where:

  • The infrastructure reports success and healthy.

  • But the business state is incomplete or inconsistent (not correct).

When Should You Execute Directly with RabbitMQ Consumer?

A production system does not put every message into BullMQ.

Direct consumer execution is completely valid for:

  • Cache invalidation.

  • Updating read models.

  • Small database updates.

  • Fast operations.

Usually:

  • Execution time is milliseconds.

  • Failures are simple.

  • Retries are straightforward.

RabbitMQ was our communication contract between services. Once we separated event delivery from business execution, we gained independent control over retries, concurrency, failures, and operational visibility.

The lesson that I got in a hard way over the last 3 years, a production system is not reliable because messages are delivered or consumed.
It is reliable only when: The business outcome tied to that message is eventually and consistently completed.
This is why infrastructure health metrics alone are insufficient to measure system correctness.

Real observability must evolve from: Did the message get processed?
to: Did the business outcome complete successfully?


If you found this insightful, follow me for more production lessons from real distributed systems. I’ll be sharing deeper architectural insights, failures we only learn after hitting production scale, and the engineering decisions that don’t usually make it into tutorials.

More production insights and distributed systems technical articles coming soon.

Production Insights

Part 1 of 2

Production Insights is a collection of engineering lessons learned from real-world challenges I encountered while designing, building, and operating production systems. The goal is to share the reasoning behind architectural decisions, the trade-offs involved, and the practical approaches used to address reliability, scalability, observability, and operational challenges in production environments.

Up next

We Thought Our API Gateway Would Protect Us. It Became Our Biggest Single Point of Failure

How We Turned Our API Gateway from a Single Point of Failure into a Reliable Layer

More from this blog