Wednesday, July 11, 2012

How to track down a high server load

SkyHi @ Wednesday, July 11, 2012
One common issue with a dedicated server (or a Virtual Private Server for that matter), is that it can be quite difficult to track down the cause of a high server load.  Most people just write it off as inevitable, but something should be done about it.  If you have a busy site, you'll want to tweak your server to handle high loads as best as possible.  But how can you find out what is causing the high load?  It's simple (kind of)...  These steps are for a *nix based server (Linux, Unix, FreeBSD (I think)).

Find out what's causing the high load

        In *nix, there's a really handy command called TOP.  What TOP does is display process information about the currently running programs.  With some of it's options, and a little output redirection, we can get a glimpse into what's causing our high load.  Here's the command...
top -b -i -n 20 >> ./top_procs
        What that does is tell TOP to run in "batch" mode (not look for any user input), show only running processes, loop 20 times, and append the output to the file /top_procs.  Run that command when you are experiencing a high server load.  Then you can view the contents of that file to tell you some information.  To view the file, you can either open it in your favorite editor (vim?), or simply use "cat ./top_procs | less".  Now, that will give you a bunch of output like this:
top - 11:06:36 up 69 days,  2:53,  0 users,  load average: 0.02, 0.05, 0.07 
Tasks: 137 total,   1 running, 136 sleeping,   0 stopped,   0 zombie 
Cpu(s):  2.3% us,  0.5% sy,  0.0% ni, 97.1% id,  0.2% wa,  0.0% hi,  0.0% si 
Mem:  12278340k total, 12230332k used,    48008k free,   363352k buffers 
Swap: 16386292k total,   157092k used, 16229200k free,  2699912k cached    
PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND  
8066 root      15   0  1888 1032  776 R  0.1  0.0   0:00.02 top  

Tasks: 137 total,   1 running, 136 sleeping,   0 stopped,   0 zombie 
Cpu(s):  2.8% us,  1.5% sy,  0.0% ni, 94.6% id,  1.1% wa,  0.0% hi,  0.0% si 
Mem:  12278340k total, 12230740k used,    47600k free,   361956k buffers 
Swap: 16386292k total,   157092k used, 16229200k free,  2696368k cached   
PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND  
8066 root      15   0  1880  944  704 R  2.0  0.0   0:00.01 top  

top - 11:06:46 up 69 days,  2:53,  0 users,  load average: 0.09, 0.06, 0.07 
Tasks: 137 total,   3 running, 134 sleeping,   0 stopped,   0 zombie 
Cpu(s):  2.2% us,  0.3% sy,  0.0% ni, 97.2% id,  0.3% wa,  0.0% hi,  0.0% si 
Mem:  12278340k total, 12173908k used,   104432k free,   363416k buffers 
Swap: 16386292k total,   157092k used, 16229200k free,  2696988k cached   
PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND  
8066 root      16   0  1888 1032  776 R  0.1  0.0   0:00.03 top  
6103 mailman   16   0 10536 7204 2744 R  0.0  0.1   0:33.08 python2.4  
6108 mailman   16   0 10172 6904 2648 R  0.0  0.1   0:37.92 python2.4
        What does all of that mean? It's really not as bad as it seems.  If you break it down, it's really just 3 repetitions of almost the same information.  Here's what it means, line by line.
  1. Line 1 - General server information - Current time, uptime (since last restart of server), number of users logged on (other than yourself), and the load average for the last 1, 5, and 15 minutes
  2. Line 2 - Tasks - Number of processes, number of actively running processes, sleeping process, stopped process, and zombie processes
  3. Line 3 - CPU usage info (User, System, Nice, Idle, Waiting, Hardware Interrupts, Software Interrupts).  Just worry about Idle, user, system, and waiting.
  4. Line 4 - Memory usage
  5. Line 5 - Swap usage (used should be almost 0 if not 0)
  6. Table header for process list (Process ID, User, Priority, Nice, Virtual Memory, Resident Size, Shared Size, , State, CPU, Memory, CPU Time used, Command)
  7. The processes themselves...
        Now, what to look for is a process that has a high CPU % that appears in multiple repetitions, as well as has a high CPU time.  Be aware that you'll more than likely have a few of them.  Write down the highest ones (most likely MySQL, Apache, etc).  Now that you know what you need to tweak, lets look at how to.

Tweaking MySQL

        If one of the top processes is MySQL, you may need to tweak MySQL for the load.  There are a whole bunch of articles out there on tweaking MySQL, so I'm not going to go into too much detail here.  Things that you will want to do is adjust the Key_buffer_size, query_cache_size, thread_cache, and table_cache to larger values (be careful not to go too big, they can easily eat up all available ram).  If you want to read more, take a look at Performance Tuning MySQL For Load.

Tweaking Apache

        Apache may appear in the list as apache or httpd.  Now, I'm not going to get into tweaking Apache for two reasons.  First, I don't use Apache, so I'm not familiar with tweaking it, and second, there is a whole host of articles on the internet devoted to tweaking Apache.  Here's a decent article on Tweaking Apache For Load.

What if it's something else?

        Now this is where things get interesting.  Are you noticing something else using your CPU time?  There are a few common culprits that like to cause high load.  The two biggest ones are SpamAssassin and Sendmail.  If you need to have SpamAssassin running, you should set it to discard all messages marked as spam to /dev/null (blackhole).  If you don't need it, disable it... It's a great program, but it uses a lot of CPU time to do what it does.  Disable all "Catch All" e-mail accounts (as they add time to the spool). 

Ok, so now what?

        So you've tweaked the server.  It's running faster, and more efficient.  For now.  As time goes on, you may need to tweak some more (as your load changes, or resources change, etc).  That's what administrating a server is all about.  Your job is never done.  However, you really should install some kind of server monitoring tool such as SICM or MRTG, and let them watch your server load.  That way you can identify patterns in load, and determine if the problem is with too many users, or something else.  I also suggest moving away from Apache, and use Lighttpd, as it uses less memory, less CPU time, and is significantly faster.  There you have it!





REFERENCES
http://www.joomlaperformance.com/articles/server_related/how_to_track_down_a_high_server_load_5_16.html