Digital exchanges achieving performance, scale, and resilience with Google Cloud

10 months ago 36
News Banner

Looking for an Interim or Fractional CTO to support your business?

Read more

As new digital-native exchanges explore running exchanges in the cloud, the focus is often on specific requirements such as latency, determinism, and resilience. Generalized storage and compute paradigms are not enough — supporting trading in the cloud requires a unique approach. This is why, for several years, Google Cloud has adopted workload optimization and intentional design as central principles for our infrastructure platform. We engineer golden paths from silicon to the customer workload, using a combination of purpose-built infrastructure, prescriptive architectures, and an open ecosystem to deliver workload-optimized infrastructure1.

An example of how we offload workloads from a host is Titanium, a system of tailored, custom silicon and multiple tiers of scale-out offloads that improve the performance, reliability, and security of our customers’ workloads. In particular, we’ve seen how a combination of custom infrastructure, prescriptive architectures, and an open ecosystem to deliver workload-optimized infrastructure can benefit latency-sensitive industry workloads such as trading exchanges.

Ensuring that digital-native exchanges can operate in the cloud is about providing capabilities that can help ensure performant, scalable, and resilient markets for the future. In particular, many exchanges rely on Aeron, an OSS based low-latency, high-throughput, fault-tolerant messaging framework. As outlined in a recent Aeron Performance Testing on Google Cloud by Adaptive, a capital markets consultancy, software provider and Google Cloud partner, running Aeron on Google Cloud’s purpose-built infrastructure can help directly address the latency, scale and resilience required to operate a cloud-based exchange, market maker, and market data aggregator.

Performance to support cloud exchange needs

The Aeron messaging fabric leveraging Google Cloud Compute Engine C3 virtual machine instance types with Data Plane Development Kit (DPDK) was able to deliver node-to-node communication of 100,000 messages per second of 288-byte data packets (e.g. FIX order message) at 18 microseconds P99 (µs - 1 millionth of a second). These metrics (Figure 1) demonstrate the ability to run with low-latency infrastructure, enabling market participants to trade on cloud exchanges with confidence. More importantly, the configuration achieved deterministic repeatability, high throughput, low latency, and high availability via Aeron Cluster with an the additional latency cost of just 18µs at P99.

Figure 1

Addressing raw latency and jitter is important, but the ability to meet growth and market volatility is more important for rapidly growing digital exchanges and their participants. Google Cloud’s design and investments in its on-host offloading and accelerator include the Infrastructure Processing Unit (IPU), which demonstrated processing 4.7 million messages per second at 288 byte size to distribute over 10 Gb per second in a single thread (Figure 2). This message rate far exceeds most exchange requirements for match-engine communication and data distribution. The Options Price Reporting Authority (OPRA) is possibly the only data feed constantly operating beyond these message requirements.

Figure 2

Exchanges (traditional and cloud-native) often need to perform expensive, disruptive, and risky re-architectures of their core messaging frameworks as market demand grows. The use of Aeron messaging aligned with C3 VMs powered by a custom IPU helps ensure the foundational messaging layer for an exchange that requires constant match processing.

“The future of capital markets liquidity is in the cloud, and the foundations of latency, determinism and resilience are now in place. We’re committed to helping market participants navigate this shift, and we’re thrilled to publish these benchmarks with Google Cloud.” - Matt Barrett, CEO, Adaptive

Scale to meet requirements of today and tomorrow

To scale and differentiate traffic, many on-prem exchange infrastructures rely on a network optimized around a 10g switching infrastructure and multi-NIC deployments. With Titanium-based technologies, Google Cloud’s network capacity at 200Gbps outruns typical exchange networks. Our workload-optimized infrastructure design helps offload the network stack from the host system, allowing the CPU to focus on maximizing performance for customer workloads. Titanium-powered C3 VMs with Hyperdisk Extreme now support 500K IOPS per compute instance to meet the needs of demanding workloads  — that’s 25% faster IOPS per instance compared to the other two leading hyperscalers2, courtesy of Titanium.  

While exchanges cannot control market volatility, they must prepare for sudden bursts and spikes in message volumes. The C3 can help, without requiring significant investments or improvements to existing applications. For example, we’ve seen a 50% reduction in P50 and P99 in Aeron latency from the prior-generation C2 instance types. Now, planning and preparing for new volatility demands no longer means re-writing or optimizing code, but simply launching new instance types. Adaptive and Google Cloud are committed to keeping Aeron up to date and tuned to the latest Google Cloud solutions.

Resilience is more than a disaster recovery plan

Latency and scalability are only a part of the challenge that digital-native exchanges are facing. You can run the fastest deterministic exchange, handling millions of transactions per second, but systems still need to be maintained, updates applied to the OS, security patches made, and new code deployed. Traditional exchanges have downtime for system maintenance during overnight hours and weekends, whereas digital exchanges operating 24/7 do not.  

Organizations running exchanges running on Google Cloud maintain operational control and the resilience of the underlying infrastructure used to run matching and trading services. To help enable resilience on a software level, Aeron Cluster provides an implementation of Raft Consensus upon which highly available services can be built. In the event a given node is unavailable, Aeron Cluster automatically nominates a new node leader and markets can operate gracefully with zero message loss and deterministic confidence. Furthermore, each exchange can have a specific high-availability objective based upon market dynamics (e.g., limit order book or RFQ platform), tolerance for order confirmation loss, and regulatory requirements. The chart below shows the latency implications to achieving deterministic repeatability via Aeron Cluster RAFT consensus in various availability approaches (Figure 3):

  • One zone in compact placement for the primary cluster plus an additional zone ready in the event of a zone failure

  • One zone in spread placement to maximize node availability for the primary cluster, plus an additional zone in the event of a zone failure

  • Three zones each running a node, enabling zero order loss within a given region  

Figure 3

While exchanges running on-prem are in control of their infrastructure, that can present challenges. There are multiple single points of failure (network, hardware, hardware supply chain, software, and people)and manual failover operations, while traditional high availability solutions remain complex and error-prone. The Aeron Cluster and Google Cloud approach provides automated failover, reducing infrastructure risk and the potential for operator error. Finally, firms requiring specific resilience patterns can automate zone separation using an organizational policy constraint that helps ensure that node placement (organization, folder, or project) in the resource hierarchy respects their resiliency requirements.

To help enable resilience on a hardware level, Google Cloud’s C3 instance types support Advanced Maintenance, which can allow Google Cloud to coordinate with customers on updating software and firmware without disrupting the customer’s workloads. Additionally, C3 Advanced Maintenance makes it possible to postpone maintenance on instances running critical workloads for up to one month, and more importantly, notifies you a week in advance of maintenance required on a given instance. By combining Aeron software and Google Cloud’s C3 capabilities, exchanges can gracefully manage their underlying infrastructure changes regardless of the source of change. This includes hot upgrades, enabling new features, and adding nodes, all without incurring service interruption.

Maintenance notifications are API-driven and provide details of every planned maintenance event, such as the window schedule and status. Aeron and exchange applications can also natively handle maintenance notifications, enabling exchanges to automatically establish operating models that are aligned to their specific operational requirements, for example low-volume hours, established change windows, and regulatory needs.

What’s next?

Exchanges pride themselves on being critical locations for global capital markets with which to manage risk. They are especially important in times of volatility from macro events, institutional failures, and natural disasters. To ensure that digital-native exchanges are available when the global financial community needs them, they need the right mix of performance, scale, and reliability.

Google Cloud’s intentional design philosophy helps ensure that exchanges are positioned to meet these requirements today, and more importantly, into the future. Exchanges working with Google Cloud can benefit from our leadership and investments including work on next-generation network protocols such as Falcon, a reliable low-latency hardware transport that contributed to the Association for Computing Machinery and Internet Engineering Task Force. Exchanges will also continue to see industry-specific performance improvements via our work with the open-source low-latency Aeron messaging framework via our partnerships with Adaptive.

If you want to run Aeron performance tests in your own environment, you can request the Aeron Performance Testing guide. The Aeron team also has a set of infrastructure provisioning modules that help with the setup and deployment of the benchmark tests. Please get in touch with the Aeron team for help with test setup on Google Cloud. And as always, don’t hesitate to reach out to your Google Cloud team with any questions you may have.


1. Titanium underpins Google’s workload-optimized infrastructure | Google Cloud Blog. Cluster with an the additional latency cost of just 18µs at P99.
2.
Titanium: A robust foundation for workload-optimized cloud computing

Read Entire Article