Systems Engineering

The Architecture of Speed: How Low-Latency Trading Systems Actually Work

Most performance advice optimizes the wrong thing. Here is what actually matters when you are building systems that need to respond in under a millisecond.

By Arindam Paul / June 2, 2026

1 ns L1 Cache Access

100 ns Main Memory Access

100× Cache Miss Penalty

<1 ms Target Response Time

The first thing most developers get wrong about high-performance systems is thinking that speed and latency are the same problem. They are not. Speed is about throughput — how many operations per second your system can sustain. Latency is about response time — how long any individual operation takes from start to finish. In trading systems, the thing that kills you is not insufficient throughput. It is unpredictable latency.

The distinction matters because the solutions are different. Optimizing for throughput often involves batching, buffering, and parallelism — all of which can introduce latency variance. What a market maker needs is not a system that can process a million messages per second on average. It needs a system where the 99.9th percentile latency is as close as possible to the median. The spikes are what hurt you, because a spike means a quote that was stale when the market moved, which means adverse selection, which means money lost.

I spent eight years at IMC Financial Markets and a year at Flow Traders — together representing some of the top options and ETP market makers in the world — building systems where this distinction was the design constraint that drove everything else. Here is what I learned.

Exchange Protocols: The Foundation You Cannot Abstract Away

Every financial exchange has a native protocol. CME uses MDP 3.0 and iLink. Eurex uses the Enhanced Transaction Solution. ICE has its own market data and order protocols. The London Stock Exchange uses ITCH for market data. NYSE Arca uses Pillar. These are binary protocols — not HTTP, not JSON, not anything a web developer would recognize — designed for maximum parsing efficiency and minimum wire size.

The fundamental lesson about exchange protocols is that you cannot abstract them away behind a generic interface without paying a performance penalty. Every abstraction layer adds cost. If your order router treats all exchanges identically through a common interface, you are paying the abstraction cost on every message. At the volumes market makers operate — tens of thousands of order updates per second across dozens of instruments — that abstraction cost accumulates into latency.

The pattern we used at IMC — where I spent more than eight years building trading systems — was to build exchange adapters that were as thin as possible, with the exchange-specific logic pushed as close to the wire as it could go. The adapter's job was to translate between our internal canonical message format and the exchange's native format, and to do nothing else. No logging on the hot path. No dynamic dispatch. No allocation. Parse the incoming bytes into a struct, call the handler, done.

Exchange connectivity also means handling session management: FIX sessions have a specific heartbeat and recovery protocol. Binary sessions have their own rules. You need to handle session disconnects gracefully, replay missed sequence numbers, and rejoin the session without losing your position in the order book. Getting this wrong produces gaps in your market data — periods where you have an incomplete view of the order book and therefore cannot price accurately. Gaps in market data are expensive.

The Ring Buffer: Why It Is the Core Data Structure of Low-Latency Systems

If you read one data structure specification before building a low-latency system, it should be the LMAX Disruptor paper. The insight behind it is simple and profound: the major source of latency in multi-threaded systems is contention — threads waiting for each other. The solution is to eliminate contention by eliminating locks.

Pattern Focus: LMAX Disruptor

A ring buffer in the Disruptor pattern is a pre-allocated circular array. Producers write by advancing an atomic sequence number; consumers read positions the producer has already committed. No locks are held during normal operation. Because the buffer is pre-allocated, no memory allocation happens during message passing — eliminating GC pressure entirely on the hot path. The result is a message-passing backbone whose performance characteristics are deterministic, not probabilistic.

A ring buffer, implemented in the Disruptor pattern, is a pre-allocated circular array. Producers write to a position in the array by advancing a sequence number. Consumers read from positions the producer has already written. No locks are held during normal operation — the coordination is done through atomic sequence numbers and memory barriers. Because the buffer is pre-allocated, no memory allocation happens during message passing. This is critical because in a JVM-based system, allocation leads eventually to garbage collection, and garbage collection introduces pauses. Even a ten-millisecond GC pause is catastrophic in a system where your target latency is measured in microseconds.

At IMC, the ring buffer was the backbone of our internal message bus. Market data came in from the exchange adapter, was written to the ring buffer, and was consumed by the pricing engine. Orders generated by the pricing engine were written to a second ring buffer and consumed by the order router. The entire hot path — from market data in to order out — touched no lock and allocated no memory. The performance characteristics were predictable because the implementation had no components that could behave unpredictably.

Anatomy of a Microsecond Trade

The Cache Hierarchy and Why It Is Not Optional to Understand

The Memory Hierarchy

Modern processors do not operate directly on main memory. They operate on caches: L1 cache (typically 32KB per core, access time roughly 1 nanosecond), L2 (typically 256KB per core, roughly 5 nanoseconds), L3 (shared across cores, several MB, roughly 20-30 nanoseconds), and then main memory at roughly 100 nanoseconds. These numbers are not abstract. At the operating frequencies of a trading system, the difference between an L1 cache hit and a main memory access is the difference between a 1-nanosecond operation and a 100-nanosecond operation — a factor of one hundred.

"The difference between an L1 cache hit and a main memory access is a factor of one hundred. At microsecond scale, that is not a footnote. It is the design."

The practical implication is that data layout matters. If your hot path accesses data that is laid out in memory such that each access is likely to be a cache miss — because the accessed fields are far apart, or because you are following pointer chains through heap-allocated objects — you will pay the main memory penalty repeatedly. The solution is cache-friendly data layout: keeping the data accessed together in the same sequence actually adjacent in memory, preferring arrays of structs to linked lists of objects, avoiding unnecessary indirection.

In Java, this is harder than in C++. The JVM provides no direct control over object layout in memory. The pattern we used was to prefer primitive arrays and off-heap memory accessed through ByteBuffer for the structures on the critical path. Off-heap memory is not managed by the garbage collector, so it does not contribute to GC pressure. It also allows you to control layout explicitly, which improves cache behavior.

CPU Affinity and NUMA Awareness

Modern servers are not uniform. A dual-socket server has two CPUs, each with its own local memory. Accessing memory that is local to your CPU — NUMA-local memory — is faster than accessing memory attached to the other socket. If your critical thread is running on CPU 0 but its data was allocated by a thread that ran on CPU 1, you pay the cross-NUMA penalty on every access. This is not hypothetical overhead. It is measurable latency that compounds with every operation.

The solution is to be explicit. Pin your critical threads to specific cores using CPU affinity. Ensure that memory accessed by those threads is allocated locally. Use isolcpus in the kernel boot parameters to prevent the OS scheduler from assigning non-critical processes to your latency-critical cores. Prevent interrupts from being delivered to those cores. The goal is to give your critical threads an uninterrupted, predictable execution environment where the only variables are the ones in your code.

"The difference in p99 latency between a process running on an isolated core and the same process running on a shared core was significant — not because the hardware was different, but because the predictability was different."

At IMC, the most latency-sensitive processes ran on isolated cores with OS jitter minimized. The difference in p99 latency between a process running on an isolated core and the same process running on a shared core was significant — not because the hardware was different, but because the predictability was different.

What Most Developers Get Wrong

The most common mistake I see when engineers approach performance optimization for the first time is optimizing before measuring. They add caching, parallelize tasks, rewrite hot functions in more efficient algorithms — all reasonable intuitions — without first identifying which part of the system is actually the bottleneck. In most trading systems I have worked on, the bottleneck was not CPU. It was memory latency, or GC pauses, or lock contention, or network jitter. CPU optimization in the absence of those fixes does very little.

The second common mistake is confusing average performance with tail performance. A system whose average latency is 50 microseconds but whose p99.9 latency is 5 milliseconds is a bad system for trading. The outlier is the event that coincides with a market move. You need to understand your entire latency distribution, not just the mean.

The third mistake is treating performance as a phase rather than a property. Performance requirements in low-latency systems cannot be retrofitted onto an architecture that was not designed for them. The decisions that determine your latency ceiling — the threading model, the data structures, the memory management approach, the network stack — are made at the beginning of the design process. If those decisions are wrong, no amount of optimization later will fix them. You rebuild from the design, or you accept the ceiling.

I have rebuilt from the design. It is significantly less pleasant than getting it right the first time. These lessons apply across every industry I have worked in. The lesson is to treat latency as a first-class requirement, not a stretch goal, from the first design conversation.

"Performance requirements in low-latency systems cannot be retrofitted onto an architecture that was not designed for them. You rebuild from the design, or you accept the ceiling."

Arindam Paul — Systems Engineer, Amsterdam