erick.africa
All articles

Engineering · Payments

Idempotent webhook reconciliation for M-Pesa

Why exactly-once is cheaper than retries-with-prayers when the counterparty is M-Pesa, plus the three primitives that get you there.

22 May 2025·7 min read

Building payments in Kenya means M-Pesa. M-Pesa means STK Push. STK Push means webhooks. Webhooks mean retries. Lots of them. If your callback is slow, M-Pesa retries. If your callback is fast but you returned a non-2xx, M-Pesa retries. If a network blip between Safaricom and your edge drops the ACK, M-Pesa retries anyway. The naive implementation double-credits buyer wallets. Once. Then it does it again.

Paystack, which sits in front of M-Pesa for InstaEscrow's card and mobile money flows, has the same retry semantics. After enough live traffic you stop trusting “the webhook fires once” and start designing for “the webhook fires N times, where N is unknown and sometimes large.”

The contract you actually need is exactly-once business effect: the user's wallet is credited exactly once, the order moves into the held state exactly once, the seller is notified exactly once. Not at-most-once (silent drops) or at-least-once (duplicate balance entries). Exactly-once on the effect, even though the transport is at-least-once.

Three primitives

1. The provider's reference is your idempotency key

Every webhook from Paystack and M-Pesa carries a unique reference the provider issued. For Paystack it's the transaction reference (reference). For raw M-Pesa it's the MerchantRequestID / CheckoutRequestID pair. Don't make up your own. The provider's ID is the only thing both you and they agree on.

Store it in a uniqueness-constrained table:

CREATE TABLE processed_webhooks (
  provider TEXT NOT NULL,
  provider_ref TEXT NOT NULL,
  received_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  payload JSONB NOT NULL,
  PRIMARY KEY (provider, provider_ref)
);

2. The unique-insert and the wallet mutation share a transaction

This is the move people miss. The webhook handler does two things: record the webhook, and apply its business effect (credit a wallet, move an escrow into funded, etc.). If you do these in two separate transactions you've created a window where the webhook is recorded but the effect didn't happen, or vice versa.

Wrap both in one Postgres transaction. The INSERT INTO processed_webhooks hits the primary key constraint. If it succeeds, you proceed to mutate state. If it fails because the row already exists, the whole transaction rolls back and you respond 200 OK without doing anything, because some prior request already did the work.

def handle_webhook(provider, payload) do
  Repo.transaction(fn ->
    case Repo.insert(%ProcessedWebhook{
      provider: provider,
      provider_ref: payload["reference"],
      payload: payload
    }, on_conflict: :nothing) do
      {:ok, %ProcessedWebhook{id: nil}} ->
        # Conflict: this webhook was already processed
        {:already_processed, payload["reference"]}

      {:ok, _record} ->
        apply_business_effect!(provider, payload)
        {:processed, payload["reference"]}
    end
  end)
end

The result is exactly-once on the business effect, regardless of how many times the webhook arrives.

3. Reconciliation closes the loop

Idempotency handles duplicates. It doesn't handle the symmetric problem: webhooks that never arrive. M-Pesa drops a webhook maybe one in ten thousand times. Paystack does it slightly less often. Either way, “we charged the buyer but never credited their wallet because the webhook was lost” is a much worse failure mode than a duplicate.

The fix is an Oban worker that runs every fifteen minutes:

  • Find every transaction in pending for more than fifteen minutes.
  • Hit the provider's /transaction/verify/:ref endpoint.
  • If the provider says successand we don't have a recorded webhook, replay one synthetically through the same handler.
  • The unique constraint protects us if the real webhook later arrives.

Reconciliation isn't a fallback. It's the source of truth on whether your system and the provider agree. Run it forever. Alert when it has to repair more than 0.1% of transactions; that probably means a webhook DNS issue or a TLS cert problem worth knowing about.

What you don't need

You don't need a queue. You don't need an event store. You don't need a saga framework. The whole pattern is two columns and a PRIMARY KEY plus one Oban worker.

The harder lesson is cultural: trust the providers' references, not your own request-IDs. Trust the database, not the webhook timing. Verify against the provider periodically, because the provider is always right and you're sometimes wrong.

The Kenya-specific bit

Two M-Pesa-specific quirks worth knowing if you're building this:

  • STK Push timeouts are silent.If the user doesn't enter their PIN within ~60 seconds, the request times out, but M-Pesa doesn't always send a callback. Treat lack of callback after 90 seconds as failed and let reconciliation correct you if it actually succeeded.
  • The C2B simulation API doesn't match prod behavior. Daraja sandbox happily accepts any phone number and returns success. Prod will reject malformed numbers and return cryptic codes. Build a thin adapter that normalizes both into the same response shape so you don't ship sandbox-specific assumptions.

Idempotent reconciliation isn't glamorous. It's also the difference between a payments product that scales and one that ships money twice.