Bringing Sub-Second Resilience in Kubernetes Cluster

A service in a Kubernetes cluster must be highly available because downtime can translate to significant business loss. Many organizations are increasingly emphasizing on the need of highly available solutions to ensure uninterrupted operations. Traditional HA Measures offers a certain degree of reliability but having a sub-second resilience in your cluster changes the game. They go beyond the traditional measures by providing almost instantaneous failover capabilities, ensuring minimal to zero downtime. Services such as voice applications, financial transactions etc. where service interruption can not only give bad customer experience but can lead to loss of business and reputation as well.

There are plenty of ways to achieve high availability. For example, Keepalived with HAProxy can be used to provide high availability but here are some important questions:

"Can Keepalived be run in any cloud environment?" - No.
"If I use HAProxy or any other LB, will it improve the cluster performance?" - No! ( Check LoxiLB vs Cilium vs MetalLB and LoxiLB vs IPVS vs HAProxy for more details)
And lastly, "Do they provide sub-second HA?" - No.

Earlier, LoxiLB released "Hitless High Availability" feature and received a ecstatic response from the community with a positive feedback to target sub-second failover detection. For this, we implemented Bi-directional Forwarding Protocol (BFD) protocol natively. BFD is a protocol used to rapidly detect failures in the forwarding path between two network devices. It is primarily employed in scenarios where fast failover detection is critical for maintaining high availability and minimizing service disruptions. It's key aspects such as ability to rapidly detect network failures, protocol-agnostic design, minimal overhead etc. makes it a essential tool for ensuring high availability and reliability.

In this article, we will explore how LoxiLB can help achieving seamless sub-second level of resiliency without any disruption of the services in your cluster.

Cluster Diagram

The idea behind this demo is very simple: Sub-second failover without service disruption. In our setup, we have prepared 4 nodes for Kubernetes, 2 nodes of LoxiLB and an external client. We are going to run BFD between 2 LoxiLB nodes. Fail-safe strategy mainly has two aspects: Detection and Action. BFD is a tool for failure detection. After the failure is detected, action is to elect a new MASTER, which again BFD will do and then advertise the service IP from the new MASTER. The latter part will be done by LoxiLB and it will depend on the cluster architecture. To explain further, BFD will elect one LoxiLB instance as a MASTER and other as a BACKUP. A LoadBalancer service will be created and a service IP will be allocated and will be advertised accordingly.

To keep things simple, we have configured everything using a single subnet. In this example, this service IP will rather be used as a virtual IP or a floating IP which means in the event of failure or fresh election, this will be attached to the MASTER LoxiLB and will be announced using a gARP message. Although, if user wishes to have a different subnet of externalServiceIPCIDR then it is advised to setup BGP peering with client and/or cluster endpoints. In that case, service IP will advertised with different BGP metrics from MASTER and BACKUP LoxiLB instances. In some scenarios, for example, some cloud environments where gARP does not work or BGP route forwarding is not supported from the native VM then LoxiLB can call cloud native APIs as a part of action to do the necessary configuration in their environment.

Install LoxiLB

Install kube-loxilb

LoxiLB provides kube-loxilb, a loadbalancer spec implementation for K8s and in order to make it come into action. You can download kube-loxilb from github, change it if needed and apply it in one of the K8s nodes.

$ cd kube-loxilb/manifest/ext-cluster
$ vi kube-loxilb.yaml

You may need to do some changes, find apiServerURL and replace the IP addresses with LoxiLB docker IPs (facing towards Kubernetes network):

containers:
     - name: kube-loxilb
       image: ghcr.io/loxilb-io/kube-loxilb:latest
       imagePullPolicy: Always
       command:
       - /bin/kube-loxilb
       args:
       #- --setRoles=0.0.0.0
       - --loxiURL=http://192.168.80.252:11111,http://192.168.80.253:11111
       - --cidrPools=defaultPool=192.168.80.5/32
       - --setLBMode=2
       #- --setBGP=65111
       #- --extBGPPeers=192.168.90.9:64512
       #- --listenBGPPort=1791 #Mandatory to mention if running with Calico CNI

Now, simply apply it :

$ sudo kubectl apply -f kube-loxilb.yaml 
serviceaccount/kube-loxilb created clusterrole.rbac.authorization.k8s.io/kube-loxilb created clusterrolebinding.rbac.authorization.k8s.io/kube-loxilb created deployment.apps/kube-loxilb created

Note: Using "setRoles" options enables mastership arbitration in kube-loxilb, which can detect and set active-standby roles for LoxiLB instances usually within a few seconds. This option should not be enabled in presence of other reliable external active-standy detection mechanism such as BFD in our case.

Prepare LoxiLB instances

Now, we have to setup LoxiLB instances in separate VMs.

In first VM, run LoxiLB(llb1) as

docker run -u root --cap-add SYS_ADMIN   --restart unless-stopped --privileged -dit -v /dev/log:/dev/log --net=host --name loxilb ghcr.io/loxilb-io/loxilb:latest --cluster=<llb2-ip> --self=0 --ka=<llb2-ip>:<llb1-ip>

In second VM, run LoxiLB(llb2) as

docker run -u root --cap-add SYS_ADMIN   --restart unless-stopped --privileged -dit -v /dev/log:/dev/log --net=host --name loxilb ghcr.io/loxilb-io/loxilb:latest --cluster=<llb1-ip> --self=1 --ka=<llb1-ip>:<llb2_ip>

Create the service

$ kubectl apply -f tcp-fullnat.yml
service/tcp-lb-fullnat created
pod/tcp-fullnat-test created

$ kubectl get svc
NAME              TYPE           CLUSTER-IP     EXTERNAL-IP        PORT(S)            AGE
kubernetes        ClusterIP      172.17.0.1     <none>             443/TCP            24m
tcp-lb-fullnat    LoadBalancer   172.17.58.84   llb-192.168.80.5   56002:32448/TCP    16m

Verify HA Status

From llb1, run the following command to check the status:

$ sudo docker exec -it loxilb loxicmd get bfd -o wide
| INSTANCE |  REMOTEIP  |  SOURCEIP  | PORT | INTERVAL  | RETRY COUNT | STATE |
|----------|------------|------------|------|-----------|-------------|-------|
| default  | 172.17.0.4 | 172.17.0.3 | 3784 | 200000 us |           3 | BFDUp |

$ sudo docker exec -it loxilb loxicmd get hastate
| INSTANCE | HASTATE |
|----------|---------|
| default  | BACKUP  |

$ sudo docker exec -it loxilb ip addr show dev lo
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever

From llb2, run the following command to check the status:

$ sudo docker exec -it loxilb loxicmd get bfd -o wide
| INSTANCE |  REMOTEIP  |  SOURCEIP  | PORT | INTERVAL  | RETRY COUNT | STATE |
|----------|------------|------------|------|-----------|-------------|-------|
| default  | 172.17.0.3 | 172.17.0.4 | 3784 | 200000 us |           3 | BFDUp |

$ sudo docker exec -it loxilb loxicmd get hastate
| INSTANCE | HASTATE |
|----------|---------|
| default  | MASTER  |

$ sudo docker exec -it loxilb ip addr show dev lo
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet 192.168.80.5/32 scope global lo
       valid_lft forever preferred_lft forever

We can also fine tune BFD session parameters to detect the failover even faster:

$ # loxicmd set bfd <remoteIP> --interval=<time in usec> --retryCount=<value>
$   loxicmd set bfd 172.17.0.4 --interval=100000 --retryCount=2

A small Demo video to see the seamless HA failover:

Conclusion

In this demo, we experienced how we can bring high resilience in our Kubernetes cluster by putting the right tools in place. As a part of next step, we will explore some VoIP applications and conduct similar tests with LoxiLB and BFD.

LoxiLB

Bringing Sub-Second Resilience in Kubernetes Cluster

Cluster Diagram

Install LoxiLB

Create the service

Verify HA Status

A small Demo video to see the seamless HA failover:

Conclusion

Recent Posts

Comments

GIThub

GETTING STARTED

Documentation

JOIN US