Community
Participate
Working Groups
This has been happening for a while, but now it's to the point where it's happening often, and for a long time, and it's severely crippling performance. Something is causing the NFS server to consume all the CPU cycles. Fetching a shared file while this nonsense is happening is crazy slow, and since some of our sites have shared data that needs to be accessed on each hit (such as Bugzilla) this introduced tremendous lag. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 24042 root 15 0 0 0 0 R 48 0.0 1114:27 [nfsd] 24047 root 16 0 0 0 0 R 46 0.0 1099:57 [nfsd] 24041 root 15 0 0 0 0 R 46 0.0 1114:14 [nfsd] 24043 root 15 0 0 0 0 R 45 0.0 1114:22 [nfsd] 24040 root 15 0 0 0 0 R 44 0.0 1114:10 [nfsd] 24046 root 15 0 0 0 0 R 44 0.0 1113:23 [nfsd] 24045 root 15 0 0 0 0 R 42 0.0 1116:47 [nfsd] 24044 root 15 0 0 0 0 R 42 0.0 1115:45 [nfsd]
viewvc (the actual Python app) was generating tremendous load, in that each page hit was loading libraries in a sloppy way: nfsd_lookup(fcntl) nfsd_lookup(fcntl.so) nfsd_lookup(fcntlmodule.so) nfsd_lookup(fcntl.py) nfsd_lookup(fcntl.pyc) nfsd_lookup(StringIO) nfsd_lookup(StringIO.so) nfsd_lookup(StringIOmodule.so) nfsd_lookup(StringIO.py) nfsd_lookup(StringIO.pyc) nfsd_lookup(mimetypes) nfsd_lookup(mimetypes.so) nfsd_lookup(mimetypesmodule.so) nfsd_lookup(mimetypes.py) nfsd_lookup(mimetypes.pyc) nfsd_lookup(struct) nfsd_lookup(struct.so) nfsd_lookup(structmodule.so) nfsd_lookup(struct.py) nfsd_lookup(struct.pyc) nfsd_lookup(compat) nfsd_lookup(compat.so) nfsd_lookup(compatmodule.so) nfsd_lookup(compat.py) nfsd_lookup(compat.pyc) nfsd_lookup(calendar) nfsd_lookup(calendar.so) nfsd_lookup(calendarmodule.so) nfsd_lookup(calendar.py) nfsd_lookup(calendar.pyc) nfsd_lookup(datetime) nfsd_lookup(datetime.so) nfsd_lookup(datetimemodule.so) nfsd_lookup(datetime.py) nfsd_lookup(datetime.pyc) nfsd_lookup(config) nfsd_lookup(config.so) nfsd_lookup(configmodule.so) nfsd_lookup(config.py) nfsd_lookup(config.pyc) mod_python would likely take care of this, but I moved it to the nodes' local disks instead. FWIW: this is a nifty debugging tool when you want to know WTF is my NFS server doing??? echo 32 > /proc/sys/sunrpc/nfsd_debug; tail -f /var/log/messages | grep lookup
Nice sleuthing, Denis! The local copy fix is a good solution for lots of reasons. :)
I went on the hunt looking for clues, since nfsd CPU usage has been very high for about 2 days now, which is unusual. Using cssh to control all the dev nodes at the same time, I issued killall -SIGSTOP rsyncd and each nfsd process instantly fell below 10%, where it should be. Resuming the processes would then cause CPU usage to jump back to 40% per process. I then terminated a couple of rsync processes until CPU load returned to normal. It seems we're just hitting a limit as to how much traffic a single NFS server can handle _efficiently_. Next step is to investigate separating dev and download from the 5-node cluster to attempt to better leverage file cache for both CVS and download, and increase the RAM in the three download servers. We need to stop going to disk all the time. node1 cvs node2 cvs node3 rsync+download node4 rsync+download node5 rsync+download
I think I've solved this.
(In reply to comment #4) > I think I've solved this. How? briefly, in terms I can understand? Sorry if its stated in your earlier posts and I just don't know to see it. I'm just curious.
FWIW, I didn't really solve the problem, but I a) now know why it happens, and b) did tweak NFS so that it has less impact on performance. We were running the default server setting of 8 threads (nfsd processes). I has assumed that each process was multi-threaded, but it isn't. When all 8 were in D state (IO Wait), all of NFS was blocked waiting for I/O. As it turns out, if all 8 are waiting for the same I/O (disk array), then NFS cannot serve requests for files on other arrays. By increasing the threads to 32, NFS is no longer totally blocked by one single disk array. NFS will still consume large amounts of CPU time when it's very busy and disk arrays are heavily used -- but at least now it puts lots of bits on the wire, instead of just crawling like before. Our plan now is to a) increase RAM on dev.eclipse.org to maximize file cache, avoiding NFS as much as possible and b) removing MySQL from the NFS servers by purchasing separate servers, to allocate more RAM on the NFS server to file cache (and to liberate CPU cycles). I'll leave this bug open until both those events happen.
The plan from comment 6 was put into place, and we're golden now.