Enable TCP keepalive on Redis cluster bus (gossip inside cluster)

Rajiv Sharma
4 min readApr 8, 2020

--

Redis Cluster

I was checking out if someone tried to enable TCP keepalive on redis cluster-bus and I got one here:

https://stackoverflow.com/questions/54802865/redis-cluster-bus-port-connections-does-not-have-keepalive-on-it

I tried to solve this problem and implemented a solution as well. If you are looking for the solution then you need to read this blog further.

Let’s find out whether or not redis uses keepalive for its clients and peer nodes in cluster.

Starting with clients of redis:

Redis cluster provides TCP keepalive property for client-connection. This property is configurable and found in redis.conf file. Following is the property:

tcp-keepalive

A non-zero value of “tcp-keepalive” (example: tcp-keepalive 60) use SO_KEEPALIVE to send TCP ACKs to clients in absence of communication. Below are the reasons, why it is useful:

  1. Detect dead peers.
  2. Take the connection alive from the point of view of network equipment in the middle.

Let’s see how cluster works internally:

Every Redis Cluster node has an additional TCP port for receiving incoming connections from other Redis Cluster nodes. This port is at a fixed offset from the normal TCP port used to receive incoming connections from clients. To obtain the Redis Cluster port, 10000 should be added to the normal commands port. For example, if a Redis node is listening for client connections on port 7000, the Cluster bus port 17000 will also be opened.

Further, the tasks like auto-discover other nodes, detect non-working nodes, and promote slave nodes to master when a failure occurs. All the cluster nodes are connected using a TCP bus and a binary protocol, called the Redis Cluster Bus. Nodes use a gossip protocol to propagate information about the cluster in order to discover new nodes, to send ping packets to make sure all the other nodes are working properly, and to send cluster messages needed to signal specific conditions.

The Cluster bus port ‘17000do not useSO_KEEPALIVE’ to establish connection with other nodes. Because of it, when we see the open connections using the following command:

netstat -ton | grep 17000output:tcp 0 0 10.143.4.86:17000 10.143.4.89:42636 ESTABLISHED off (0.00/0/0)
tcp 0 0 10.143.4.86:17000 10.143.4.86:34100 ESTABLISHED off (0.00/0/0)
tcp 0 0 10.143.4.86:17000 10.143.4.87:45852 ESTABLISHED off (0.00/0/0)
tcp 0 0 10.143.4.86:37218 10.143.4.87:17000 ESTABLISHED off (0.00/0/0)
tcp 0 0 10.143.4.86:40130 10.143.4.87:17000 ESTABLISHED off (0.00/0/0)

The above TCP connections don’t have a keep-alive timer on it. There is no way for administrator to know that these connections are invalid and it will be kept open indefinitely.

Linux itself do not provide some default beahviour so that all the connections will be created to and from the machine must have timer on it.

To add the timer on these connections, there are two ways:

  1. Modify the source code and send the pull request to the owner. If you are not a programmer then you can just ask the owner to add ‘SO_KEEPALIVE’ while creating connections.
  2. Use libkeepalive: library preloading

I was looking for the solution and I found this library. I read about it a little and tried to implement it.

It consists of a shared library that overrides the socket system called in most binaries, without the need to recompile or modify them. The technique is based on the preloading feature of the ld.so(8) loader included in Linux, which allows you to force the loading of shared libraries with higher priority than normal.

Lets see, how libkeepalive can help us here:

Download and extract the shared library file.

wget  https://excellmedia.dl.sourceforge.net/project/libkeepalive/libkeepalive/0.3/libkeepalive-0.3.tar.gztar -xzvf libkeepalive-0.3.tar.gz

Next step to place it in /usr/lib.

cd libkeepalive-0.3
make
cp libkeepalive.so /usr/lib/

Lets test it, if it works for us:

$ cd test$ ./test
SO_KEEPALIVE is OFF
$ LD_PRELOAD=/usr/lib/libkeepalive.so \
> KEEPCNT=20 \
> KEEPIDLE=180 \
> KEEPINTVL=60 \
> ./test
SO_KEEPALIVE is ON
TCP_KEEPCNT = 20
TCP_KEEPIDLE = 180
TCP_KEEPINTVL = 60

Kudos!!! it works. Let’s try this with redis-cluster

$ LD_PRELOAD=/usr/lib/libkeepalive.so KEEPCNT=20 KEEPIDLE=180 KEEPINTVL=60 /usr/bin/redis-server /etc/redis.conf

In another terminal, check the output of

netstat -ton | grep ESTABLISHED | grep 17000tcp 0 0 10.143.4.86:41046 10.143.4.87:17000 ESTABLISHED keepalive (166.82/0/0)
tcp 0 0 10.143.4.86:17000 10.143.4.89:44282 ESTABLISHED keepalive (166.82/0/0)
tcp 0 0 10.143.4.86:17000 10.143.4.89:44284 ESTABLISHED keepalive (162.73/0/0)
tcp 0 0 10.143.4.86:35134 10.143.4.89:17000 ESTABLISHED keepalive (162.73/0/0)
tcp 0 0 10.143.4.86:37218 10.143.4.87:17000 ESTABLISHED keepalive (164.78/0/0)
tcp 0 0 10.143.4.86:17000 10.143.4.87:47532 ESTABLISHED keepalive (169.38/0/0)

We are ready to make the changes in service file:

service file: /usr/lib/systemd/system/redis-master.service

Reload the daemon and restart the service

systemctl daemon-reload
service redis-master stop
service redis-master start

This way I enabled keepalive timer in my redis-cluster. If you want to do it in your production environment, please comment on this blog and I will help.

Moreover, you can use same shared library with any other process to use SO_KEEPALIVE without modifying code.

For more details on TCP Keepalive and tuning at Linux level, you can refer the following link:

https://www.tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/

For information about redis-cluster, you may refer the following link:

https://redis.io/topics/cluster-spec

--

--

Responses (3)