Engineering · Distributed systems · Messaging
Redis Streams vs Kafka vs RabbitMQ: three brokers I've shipped, when I reach for each
I've run all three in production. Kafka powers SafeBoda's earnings and SOS pipelines. Redis Streams is the eventing backbone of InstaEscrow. RabbitMQ has been the work-queue layer in two contracts in between. They are not interchangeable. Here's how I decide.
14 October 2025·10 min read
Every senior engineer eventually has the “which message broker should we use” conversation. The answer in most teams is one of:
- What do we already run? (Often correct.)
- What does the loudest engineer like? (Often Kafka.)
- What does the cloud provider give us as a managed service? (Often the right answer for under-resourced teams.)
None of those engage with the actual question, which is: what shape is your messaging workload, and which broker fits that shape. After running all three in production for years, I have strong opinions about the shapes, and almost no opinion at all about which broker is “best.” Best is workload-shaped.
The three workload shapes
Shape A: high-throughput event streaming with replay
You have a firehose of events that many independent consumers want to read, possibly at different speeds, possibly historically. SafeBoda's ride events are this shape. A single trip emits twenty events; downstream consumers (driver earnings, loyalty, fraud, BI, ops dashboards, audit log) all want their own independent slice, at their own pace, with the option to rewind to last Tuesday and recompute.
Properties that matter:
- Long retention. Days to weeks of events kept on disk.
- Multiple consumer groups. Each group keeps its own offset.
- Partitioned ordering. Order matters within a partition (per trip), not across them.
- Throughput. Tens or hundreds of thousands of events per second sustained.
- Backwards replay. Bug shipped Tuesday, reprocess from Monday.
Use Kafka. This is the shape Kafka was built for. The partitioned-log model gives you ordered replay, the consumer group abstraction handles independent fan-out elegantly, and the operational story for retention is mature. The cost is real. You run Kafka brokers (or pay for managed Confluent / MSK / Redpanda), you maintain Schema Registry, you accept that your messaging subsystem is a meaningful piece of infrastructure.
Shape B: per-aggregate eventing with replay for clients
You have entities (users, orders, escrows) that emit events belonging to that entity. Real-time UIs subscribe to events for the entities they care about. Reconnects need replay. This is InstaEscrow's shape: every escrow, every user has a timeline; clients connect via SSE and need to catch up after a backgrounded phone reconnects.
Properties that matter:
- Stream-per-aggregate. Possibly thousands of streams, each small.
- Per-stream replay. Give me everything since event ID X.
- Modest retention. Minutes to days, not weeks.
- Operational simplicity. Solo team, no Kafka cluster appetite.
- Cohabitation with cache + rate-limit infra. Redis is already there.
Use Redis Streams. The thousand-streams pattern kills Kafka. Kafka is built around tens to hundreds of partitions, not thousands of independent topics. Redis Streams happily holds millions of small streams as long as you keep total memory in check. XREAD BLOCK + per-event monotonic IDs gives you the SSE replay primitive for free. And you almost certainly already run Redis. (I wrote a full essay on why I picked Streams over Phoenix PubSub for InstaEscrow specifically; read that for the architectural details.)
Shape C: work distribution with rich routing
You have units of work (send this email, render this PDF, charge this card, transcode this video) that need to be dispatched to a pool of workers, retried on failure, deduplicated, prioritized, delayed, dead-lettered. The system is a job queue, not an event log.
Properties that matter:
- Work distribution semantics. Competing consumers, ack-or-requeue, prefetch.
- Rich routing. Topic exchanges, headers, fan-out, direct.
- Per-message TTL, delays, priorities, DLQ.
- Acknowledgement-based reliability, not log-based replay.
- RPC-over-message-bus sometimes (less common now, but historically common).
Use RabbitMQ. This is what AMQP was designed for. Work queues, dead-letter exchanges, delayed messages, fanout routing, all first-class. RabbitMQ also handles smaller volumes (tens of thousands per second comfortably, hundreds of thousands with care) more gracefully than Kafka, which has a real per-partition cost.
Caveat: in 2026 a lot of work-queue use cases are better served by in-Postgres queue libraries (Oban for Elixir, Sidekiq for Ruby, BullMQ for Node, db-queue for everyone). If you don't need cross-language consumers and you already run Postgres or Redis, skip the broker entirely. I've replaced two RabbitMQ deployments with Oban-on-Postgres and not missed it.
Side-by-side, the way I actually decide
| Dimension | Kafka | Redis Streams | RabbitMQ |
|---|---|---|---|
| Model | Partitioned log | Per-key log | AMQP queues + exchanges |
| Throughput sweet spot | 10K – 1M+ msg/s | 1K – 100K msg/s | 1K – 100K msg/s |
| Retention | Days–weeks easily | Minutes–days (memory-bound) | Until acked |
| Replay | First-class (offset seek) | First-class (XREAD from ID) | Not really (acked = gone) |
| Ordering | Per partition | Per stream key | Per queue (lost on requeue) |
| Routing | Topic-only | Stream-key-only | Rich (direct, topic, headers, fanout) |
| Operational cost | High (broker cluster) | Low (already running Redis) | Medium (broker, mirrored queues) |
| Best for | Event streaming + analytics | Per-entity timelines + SSE | Work distribution + complex routing |
Three real workloads I'd map to each
SafeBoda earnings ledger → Kafka
Driver earnings come from a stream of events: trip completed, bonus applied, rating awarded, surge multiplier earned. We need to recompute earnings from history when a bug ships, retain events for thirty days for dispute resolution, and let three independent consumers (the ledger, the analytics warehouse, the driver app notification service) read at their own pace. Kafka. Specifically Kafka with three consumer groups against the same topic.
InstaEscrow per-user event timeline → Redis Streams
Each user has a stream user:{id}:eventswith their notifications, escrow updates, dispute timeline events. Mobile and web clients hold long-lived SSE connections to a Phoenix endpoint that pulls from the user's stream via XREAD BLOCK with replay-from-last-event-id on reconnect. Streams handles the thousands-of-tiny-streams shape Kafka would struggle with.
InstaEscrow notification dispatch → Postgres + Oban (not RabbitMQ)
“Send WhatsApp + SMS + push for this escrow event” is classic work-queue territory. RabbitMQ would work fine. But Postgres + Oban gives me transactional enqueue (the notification is enqueued in the same transaction as the state change, no possibility of phantom or missed sends), no extra broker to run, and excellent debugging from inside the database. The broker-replacement story for in-process queues backed by Postgres keeps getting stronger.
The decision tree, distilled
- Do you need a job queue (work distribution + retries + DLQ + delays)? → If you already run Postgres or Redis, use Oban / Sidekiq / BullMQ. If you need cross-language consumers and rich routing, RabbitMQ.
- Do you need an event log with replay across multiple consumer groups, sustained high throughput, and long retention? → Kafka.
- Do you need per-entity timelines with replay-on-reconnect for live clients, and you already run Redis? → Redis Streams.
- Are you a small team that doesn't want to run any of these, and your throughput is < 10K msg/s? → Skip the broker entirely. Postgres + a queue library handles your problem and gives you transactional semantics for free.
The mistake I see most often
Adopting Kafka because it's “the right tool” when the actual workload is a work queue with maybe 5K msg/s. You'll spend three months on the broker, three months on Schema Registry, three months on partition strategy, and ship a third of the features you would have shipped if you'd started with Postgres + Oban and replaced it later when you actually had the throughput.
The corollary: the right answer to “which broker” is often none yet. Boring infrastructure carries you further than people give it credit for.