Event-driven systems are usually introduced as a decoupling pattern. That framing is useful, but it hides the harder problem.
Once a business fact is published asynchronously, it stops being one observation and becomes many independent observations. One consumer may see it in 80 milliseconds. Another may see it in 3 seconds. Another may see a related later event first because it came through a different partition, a different retry path, or a different backlog profile.
That is the real source of correctness bugs in event-driven systems. The trouble is rarely that the broker violated its local guarantees. The trouble is that the application quietly depended on a stronger order than the system ever actually provided.
Most teams use the word “ordered” as if it means the business flow will be observed in the right sequence.
It usually does not mean that.
It might mean the broker preserves append order within a partition. It might mean one consumer fetches messages serially. It might mean a sink rejects stale versions. It might mean a replay job is deterministic. It might mean none of those, and people are just speaking loosely.
That looseness is where correctness debt starts.
A system does not become safe because events are ordered somewhere. It becomes safe when the consumer behavior is correct under the exact ordering the system actually provides.
Most systems are not missing order. They are missing an honest sentence about where order stops.
The Order service owns the order write path. It commits an orders row and emits order_created.
The Payment service captures funds and emits payment_captured.
The Fulfillment service allocates stock and emits order_shipped.
Downstream consumers subscribe for different reasons:
customer timeline
notification service
support tooling
search indexing
analytics
fraud detection
revenue recognition
partner webhooks
Now add the parts the diagram usually omits.
The order-events topic has 24 partitions and is keyed by order_id.
The payment-events topic has 12 partitions and is keyed by customer_id because the payments team cared more about wallet-level analysis than order-level sequencing.
The fulfillment-events topic has 36 partitions and is keyed by warehouse_region + shipment_id to smooth hotspot load during peak hours.
The timeline service consumes all three topics and materializes a customer-visible history.
Under steady load, median publish-to-consume latency is around 120 milliseconds. Global p99 is 1.8 seconds. During a large sale event, one hot partition may accumulate 300,000 to 500,000 queued messages and drift several minutes behind while most partitions stay nearly current.
That system can be healthy enough to pass infrastructure dashboards and unhealthy enough to produce a broken business narrative.
Nothing has “gone wrong” yet. This is just the baseline reality of a scaled event-driven system.
Diagram placeholder
One Business Flow, Three Topics, No Shared Order
Show one order lifecycle spread across order, payment, and fulfillment topics with different partitioning strategies, and make it visually obvious that a customer timeline can observe shipped before created without any broker misbehavior.
Placement note: Place near Request Path Walkthrough after the first end-to-end example.