Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 288293 - Reduce load on NFS server by removing MYSQL, increase RAM
Summary: Reduce load on NFS server by removing MYSQL, increase RAM
Status: RESOLVED FIXED
Alias: None
Product: Community
Classification: Eclipse Foundation
Component: Servers (show other bugs)
Version: unspecified   Edit
Hardware: PC Linux
: P3 normal (vote)
Target Milestone: ---   Edit
Assignee: Eclipse Webmaster CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-09-01 14:29 EDT by Denis Roy CLA
Modified: 2011-07-11 09:25 EDT (History)
5 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Denis Roy CLA 2009-09-01 14:29:30 EDT
This has been happening for a while, but now it's to the point where it's happening often, and for a long time, and it's severely crippling performance.

Something is causing the NFS server to consume all the CPU cycles.  Fetching a shared file while this nonsense is happening is crazy slow, and since some of our sites have shared data that needs to be accessed on each hit (such as Bugzilla) this introduced tremendous lag.

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
24042 root      15   0     0    0    0 R   48  0.0   1114:27 [nfsd]
24047 root      16   0     0    0    0 R   46  0.0   1099:57 [nfsd]
24041 root      15   0     0    0    0 R   46  0.0   1114:14 [nfsd]
24043 root      15   0     0    0    0 R   45  0.0   1114:22 [nfsd]
24040 root      15   0     0    0    0 R   44  0.0   1114:10 [nfsd]
24046 root      15   0     0    0    0 R   44  0.0   1113:23 [nfsd]
24045 root      15   0     0    0    0 R   42  0.0   1116:47 [nfsd]
24044 root      15   0     0    0    0 R   42  0.0   1115:45 [nfsd]
Comment 1 Denis Roy CLA 2009-09-01 15:43:06 EDT
viewvc (the actual Python app) was generating tremendous load, in that each page hit was loading libraries in a sloppy way:
nfsd_lookup(fcntl)
nfsd_lookup(fcntl.so)
nfsd_lookup(fcntlmodule.so)
nfsd_lookup(fcntl.py)
nfsd_lookup(fcntl.pyc)
nfsd_lookup(StringIO)
nfsd_lookup(StringIO.so)
nfsd_lookup(StringIOmodule.so)
nfsd_lookup(StringIO.py)
nfsd_lookup(StringIO.pyc)
nfsd_lookup(mimetypes)
nfsd_lookup(mimetypes.so)
nfsd_lookup(mimetypesmodule.so)
nfsd_lookup(mimetypes.py)
nfsd_lookup(mimetypes.pyc)
nfsd_lookup(struct)
nfsd_lookup(struct.so)
nfsd_lookup(structmodule.so)
nfsd_lookup(struct.py)
nfsd_lookup(struct.pyc)
nfsd_lookup(compat)
nfsd_lookup(compat.so)
nfsd_lookup(compatmodule.so)
nfsd_lookup(compat.py)
nfsd_lookup(compat.pyc)
nfsd_lookup(calendar)
nfsd_lookup(calendar.so)
nfsd_lookup(calendarmodule.so)
nfsd_lookup(calendar.py)
nfsd_lookup(calendar.pyc)
nfsd_lookup(datetime)
nfsd_lookup(datetime.so)
nfsd_lookup(datetimemodule.so)
nfsd_lookup(datetime.py)
nfsd_lookup(datetime.pyc)
nfsd_lookup(config)
nfsd_lookup(config.so)
nfsd_lookup(configmodule.so)
nfsd_lookup(config.py)
nfsd_lookup(config.pyc)

mod_python would likely take care of this, but I moved it to the nodes' local disks instead.

FWIW: this is a nifty debugging tool when you want to know WTF is my NFS server doing???
echo 32 > /proc/sys/sunrpc/nfsd_debug; tail -f /var/log/messages | grep lookup
Comment 2 Karl Matthias CLA 2009-09-02 13:39:05 EDT
Nice sleuthing, Denis!  The local copy fix is a good solution for lots of reasons. :)
Comment 3 Denis Roy CLA 2009-11-26 22:23:29 EST
I went on the hunt looking for clues, since nfsd CPU usage has been very high for about 2 days now, which is unusual.

Using cssh to control all the dev nodes at the same time, I issued killall -SIGSTOP rsyncd and each nfsd process instantly fell below 10%, where it should be.  Resuming the processes would then cause CPU usage to jump back to 40% per process.

I then terminated a couple of rsync processes until CPU load returned to normal.  It seems we're just hitting a limit as to how much traffic a single NFS server can handle _efficiently_.

Next step is to investigate separating dev and download from the 5-node cluster to attempt to better leverage file cache for both CVS and download, and increase the RAM in the three download servers.  We need to stop going to disk all the time.

node1   cvs
node2   cvs
node3   rsync+download
node4   rsync+download
node5   rsync+download
Comment 4 Denis Roy CLA 2009-11-27 00:26:48 EST
I think I've solved this.
Comment 5 David Williams CLA 2009-12-05 23:32:20 EST
(In reply to comment #4)
> I think I've solved this.

How? briefly, in terms I can understand? Sorry if its stated in your earlier posts and I just don't know to see it. 

I'm just curious.
Comment 6 Denis Roy CLA 2009-12-07 09:45:18 EST
FWIW, I didn't really solve the problem, but I a) now know why it happens, and b) did tweak NFS so that it has less impact on performance.

We were running the default server setting of 8 threads (nfsd processes).  I has assumed that each process was multi-threaded, but it isn't.  When all 8 were in D state (IO Wait), all of NFS was blocked waiting for I/O.  As it turns out, if all 8 are waiting for the same I/O (disk array), then NFS cannot serve requests for files on other arrays.  By increasing the threads to 32, NFS is no longer totally blocked by one single disk array.

NFS will still consume large amounts of CPU time when it's very busy and disk arrays are heavily used -- but at least now it puts lots of bits on the wire, instead of just crawling like before.

Our plan now is to a) increase RAM on dev.eclipse.org to maximize file cache, avoiding NFS as much as possible and b) removing MySQL from the NFS servers by purchasing separate servers, to allocate more RAM on the NFS server to file cache (and to liberate CPU cycles).  I'll leave this bug open until both those events happen.
Comment 7 Denis Roy CLA 2011-03-10 15:33:05 EST
The plan from comment 6 was put into place, and we're golden now.