NFS Hangups

Scenario

Our players today: a parallel filesystem, an NFS server, and some network shenanigans. One of the ways you can build shared network storage is to take the above ingredients and form them into the following shape.

NB. Emphasis on 'one'. This cat comes in many forms, and you should be able to build this at home if you want to. Don't, but have fun.

A parallel filesystem - in this case we are going to use GPFS - split into two nodes per leaf switch, and we will have 12 of them altogether. Each of these leaf switches has its own IPv4 subnet. These are then grouped together in groups of 4 to form a network pod, and then joined up until we have a nice fat tree network.

In order to route between leaves, each of our switches will be routers and each will be able to use BGP to advertise routes to other pods; however, we can switch between our group of 4. This is only for the physical IPs each node will have. NFS will be given out by a Virtual IP that can move across any node, and so we need to have BGP running on each node to update and tell routers that a node can be routed to. This VIP is a secondary IP and gets added to the primary interface of a node. When and if it needs to move, that IP can be deleted from the interface, added to another node, and BGP will update any neighbours that it is now found via this new path.

Since we are using GPFS for this our life choices have let us to use CES (Cluster Export Services) to manage these vips and our NFS servers. This service is IBMs way managing NFS ontop of GPFS. Each of our 12 nodes will have a NFS server running and the minimum of two vips.

In this environment, clients of NFS will use the kernel client almost all of the time. I have not seen any userspace clients - not to say there are none - but in this we are exclusively a kernel client consumer. For this, the clients are using version 5.15.125. Clients mount their share as NFSv3 and have the default options. No new features like nconnect. In this setup they are given a random VIP from the DNS round robin for the cluster. Some of the reasons why a VIP may move are to balance out resources from competing clients. An admin can tell CES to move a VIP, as well as its own balancing parameters to move VIPs itself. A VIP also needs to move if the NFS server crashes, the node is removed, or the node crashes.

Whenever one of these occurs, the customer complains that their IO has stalled or is slow, their application is seeing NFS timeouts, or worse, it crashes if the time was too long for their program.

Testing

In some cases it was thought this was just CES's fault, where even-coverage was broken and VIPs would move, flopping back and forth between nodes. Even if you think to pin VIPs to nodes, they would still need to be moved back if nodes go down, interrupting a client's work.

If any kind of VIP move causes an issue, then we can test by picking a VIP specifically, running a large amount of reads or writes from the client, and moving the VIP between two of our storage nodes.

fio --iodepth=1 
  --name job --group_reporting=1 
  --ioengine=libaio --direct=1 
  --bs=1M --filename=/scratch/fio.dat 
  --size=500GB --rw=write 
  --numjobs=32 --rate_iops=1

FIO can be used to simulate some client work. Packet captures for both client and servers can be set up:

Client

tshark -f 'ip host 10.0.1.191' -s 256 -i eth0 -w - | tshark -NNn -r - -Y "not rpc"

Server

tshark -f 'ip host 10.0.2.16' -s 256 -i eth0 -w - | tshark -NNn -r -

We can also watch the state of TCP sockets on the client using BPF:

./tcpstates -4 -D 2049,32767

sequenceDiagram
    participant C as Client
    participant A as Node A
    participant B as Node B

    rect rgb(234,243,222)
    C->>A: GETATTR
    A-->>C: Reply
    end

    Note over A,B: CES moves VIP from A to B

    rect rgb(252,235,235)
    C->>A: GETATTR
    A-->>C: RST
    C->>A: Retransmissions (exponential backoff)
    Note right of C: NFS client holds socket open despite RST. Keepalive and retransmit are derived from mount options.
    end

    rect rgb(234,243,222)
    C->>B: SYN
    B-->>C: SYN-ACK
    C->>B: GETATTR (IO resumes)
    end

What is observed is a long recovery between the reset being sent and the session recovering on the new node. In some cases the keepalive would find the server dead but the socket is still being held. In some, the timeout would need to resolve before the session resumes. To see what further is occurring in the NFS client:

When tested on 5.15.125, the timeout function looks like this:

static void xs_tcp_set_socket_timeouts(struct rpc_xprt *xprt, struct
socket *sock)
{
struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
unsigned int keepidle;
unsigned int keepcnt;
unsigned int timeo;

spin_lock(&xprt->transport_lock);
keepidle = DIV_ROUND_UP(xprt->timeout->to_initval, HZ);
keepcnt = xprt->timeout->to_retries + 1;
timeo = jiffies_to_msecs(xprt->timeout->to_initval) * 
    (xprt->timeout->to_retries + 1);
clear_bit(XPRT_SOCK_UPD_TIMEOUT, &transport->sock_state);
spin_unlock(&xprt->transport_lock);

/* TCP Keepalive options */
sock_set_keepalive(sock->sk);
tcp_sock_set_keepidle(sock->sk, keepidle);
tcp_sock_set_keepintvl(sock->sk, keepidle);
tcp_sock_set_keepcnt(sock->sk, keepcnt);

/* TCP user timeout (see RFC5482) */
tcp_sock_set_user_timeout(sock->sk, timeo);
}

What this does is set the TCP socket timeout. In versions 3.12 and 4.2 it was changed to set the timeout settings from the client mount options instead of the system sysctl settings.

This means that with defaults of: * keepidle will be 60s before a keepalive probe is sent * keepintvl will be 60s between each probe * keepcnt is 2 probes before the connection is dead

With this, an NFS client on Linux can give up a connection that has stopped responding at 180s (3 min). Now even if the new server is ready and the VIP has moved, the old session may still have to wait that long before it can resume.

Along with this, if the client reuses the same port, it can be stuck in TIME_WAIT or in another state waiting for the port to be closed. This has to be cleaned up before resuming IO. I found from AWS EFS that you can set noresvport. This has the behaviour that in situations like this, the client can abandon the old port, let it close on its own time, and start a new session on some other port. It also has the effect of changing the path over an ECMP network, and any connection tracking systems in between would see a new session.

Some installs will have F-RTO enabled. This will add one round trip time delay before retransmitting. As far as I understand, this is for WiFi interfaces to handle their congestion control. Since you are probably not going to be running a setup like this over WiFi, you can set this to 0.

One thing that can also occur in this is a number of TCP packet retransmissions. This can increase the time for the session to recover in some kernel versions. In this version, lowering the number from the default of 15 can impact the recovery time, but this is the start of where this work needs to be remeasured.

Reducing the failover time

When put together, the last time this was tested, recovery time from a failover was reduced by mounting with:

noresvport
timeo=100
retries=1

This is probably the minimum you can tune depending on your kernel version. It is worth looking at the logic that is implemented depending on your version.

What's next?

For both FRTO and TCP_RETRIES2, having relooked at this, it is worth experimenting more. The list of experiments on this are:

measure the BGP convergence time
measure the ARP update time
measure NDP if using IPv6
further changes to this code in version 6 kernels
NFSv4.1 session trunking
testing server-side RST when moving a VIP
measuring tcp_retries2 and the impact of congestion
continue experimenting

What is displayed here is notes and remembrances from work. Experiment in your own test environment before changing knobs in production. For myself, it warrants redoing this and improving on the testing methods.

Let's go break NFS