Fix stale NFS mounts on linux without rebooting
I have often noticed that some folks reboot systems to fix stale NFS mount problems which can be disruptive.
Fortunately, that often isn’t necessary. All you have to do is restart nfs and autofs services. However that sometimes fails because user processes have files open on the stale partition or users are cd’ed to the stale partition.
Both conditions are easy to fix. The steps to fix stale mounts by addressing the previously described conditions are described below.
Step 1. Kill process with open files on the partition
Use lsof to find the processes that have files open on the partition and then kill those processes using kill or pkill.
% # Find the jobs that are accessing the state partition and kill them.
% kill -9 $(lsof |\
egrep ‘/stale/fs|/export/backup’ |\
awk ‘{print $2;}’ |\
sort -fu )
% # Restart the NFS and AUTOFS services
% service nfs stop
% service autofs stop
% service nfs start
% service autofs start
% # Check it
% ls /stale/fs
Typically this is sufficient but if it fails, you need to go to step 2.
Step 2. Kill process that have cd’ed to the partition
Look at the current working directory of all of the users. If any of them are on the partition, that process has to be killed.
% # List the users that are cd’ed to the stale partition and kill their jobs.
% # NOTE: change /stale/fs to the path to your stale partition.
% kill -9 $( for u in $( who | awk ‘{print $1;}’ | sort -fu ) ; do \
pwdx $(pgrep -u $u) |\
grep ‘/stale/fs’ |\
awk -F: ‘{print $1;}’ ; \
done)
% # umount the stale partition
% umount -f /state/fs
% # Restart the NFS and AUTOFS services
% service nfs stop
% service autofs stop
% service nfs start
% service autofs start
% # Check it
% ls /stale/fs
Step 3. Kill all of the users
If step 2 doesn’t work then there is something strange going on but killing all of the user processes will usually fix it. That is done as follows.
% # Kill all user processes.
% for u in $( who | awk ‘{print $1;}’ | sort -fu ) ; do \
kill -9 $(pgrep -u $u) |\
awk -F: ‘{print $1;}’ ; \
done
% # umount the stale partition
% umount -f /state/fs
% # Restart the NFS and AUTOFS services
% service nfs stop
% service autofs stop
% service nfs start
% service autofs start
% # Check it
% ls /stale/fs
As you can see, it is basically the same as step 2 except that all user processes are killed.
If that doesn’t work you need to resort the nuclear option: rebooting.
Step 4. Reboot
This is the option of last resort but it should always work.
If you know of any other tips for fix stale NFS mounts I would really like to hear about them.
Recent Comments