Running LoxiLB on AWS Graviton2-based EC2 instance
With the growing need for low latency applications, network performance has become the most important chokepoint while deploying any type of workloads behind any load balancer. In this blog post, we will discuss network performance metrics such as Throughput, Minimum and Average Latency, Connections per second(CPS) and Requests per second(RPS) on AWS Graviton2 . For our benchmark tests, we had put LoxiLB under the knife, along with IPVS and haproxy in AWS Graviton2-based EC2 instance. AWS Graviton2 Processors are based on 64-bit Arm Neoverse cores and custom silicon designed by AWS for optimized performance and cost. Reason for choosing Arm for this benchmarking report is that Arm is well known for performance per watt and computational efficiency and well suited for edge use cases where power consumption is a real challenge.
To spawn a Graviton2 EC2 instance, we need to select the following in AWS dashboard :
For installing loxilb and testing, one can follow the scripts here.
We have used a very simple network topology. A service(netserver) with 3 endpoints will run behind the load balancer which will distribute the connections in round robin fashion. We have used a single AWS Graviton-2 based EC2 instance(m6g.metal, arm64, 64 vCPU) for running client, load balancer and workloads to evaluate software-only performance. Same environment will used to compare LoxiLB with IPVS and haproxy. We have used most popular performance tools like iperf (for throughput) and netperf (for CPS, RPS and Latency).
Let's start with the most common performance metric - Throughput, the fastball pitch of benchmarking tests. When it comes to load balancer's capacity of handling the traffic, verifying throughput becomes a critical metric. If your load balancer cannot process all the incoming traffic then it will become a performance bottleneck. In throughput test, test application send lesser number of large sized packets because the main agenda is to test the throughput.
We evaluated LoxiLB, IPVS and haproxy for 100 TCP streams using iperf.
The result shows LoxiLB achieved the maximum throughput of around 533Gbps while IPVS is giving a tough fight with just over 500Gbps. Haproxy is not even reaching the half-way mark. Reason for better throughput is simple - eBPF based core engine. It starts working at the TC layer of the Linux Kernel Stack. Packet is processed and sent out at the same layer, bypassing all the above layers, while IPVS run at Kernel level but it can't avoid traversing the whole kernel stack, hence, more processing and less throughput. As Haproxy does everything in user-plane, it chokes on high throughput.
Connections per second
The paramount metric for a L4 stateful load balancer, if you will. This metric tells you how many L4 connections will be established(tracked and registered) per second. Tracking a new connection is usually costly as new connection needs to be validated before putting it in the CT table/Cache. Hence, how fast a load balancer can track a new connection becomes one of the most important performance index.
We have used netperf with TCP_CRR option. For each connection/transaction, netperf opens the TCP connection, exchanges Request and Response and then closes the connection. That means it will create a new connection for each transaction.
In the previous test, we noticed IPVS was lagging behind LoxiLB by almost 5% but in this test, the difference in the performance is quite substantial. LoxiLB being the clear winner here, leaving IPVS way behind. Reason is again the same - LoxiLB eBPF based core engine, which tracks the connection faster, hence, it scales better than IPVS and haproxy.
Requests per second
This benchmark test is likely to put less stress on the load balancer because it is going to evaluate the number of requests processed by the load balancer in single connection. In other words, this test is to evaluate how quickly or efficiently a network packet will be processed i.e. latency. The application which can process packets faster will definitely provide lower latency. In this test, more and more packets will be sent to find out the packets(requests) processing power of the load balancer.
We have used netperf for this test with TCP_RR option. For consistency, we used 100 netperf streams in this test as well.
We can see that LoxiLB performs much better than IPVS and haproxy for efficiently processing requests.
Minimum and Average Latency
In the previous section, we talked about more Requests per second means lower latency. Let's find out the actual numbers with netperf's test.
LoxiLB stands out in all the benchmarking tests. Simple reason for these results is LoxiLB eBPF core engine which acts like a one-stop shop for packet processing and that too in kernel layer. Packets does not move around in the kernel after processing. On the other hand, packets before and after processing in IPVS have to traverse the whole kernel stack resulting into unnecessary lag. And, Haproxy has highest latency because runs in user plane and can never match LoxiLB or IPVS's performance.
Note: While conducting this benchmark test, we observed that haproxy was not able to run every time. We saw "Connection reset by peer" error many times.
We hope you liked our blog. Soon, We will publish the results for multi-node scenario as well, in which client, load balancer and workloads will run in separate servers.
To know more about:
Why eBPF(LoxiLB) performs better than IPVS, please read here.
Similar Performance test/results for x86 can be found here.