Archive for January, 2009

Munin

Sunday, January 4th, 2009

Because I use Splunk to track the logs on my home server, I have setup some reports that show me the level of errors relative to the total log lines that allows me to notice trends. One file that has cropped up a lot is the /var/munin/munin-update.log file. This is the file that I have my Munin master logging to. The particular error that keeps cropping up is:

Jan 03 20:20:44 [3622] - Client reported timeout in fetching of cpu_tmp_sensors

In Munin this is realised by a broken graph viz:

So what is happening is that the munin node is timing out the response from the plugin, and then passing the timeout response on to the munin master. I couldn’t actually find any documentation on this timeout amount to even see what the default was except by looking in the source code itself.

Some Googling did reveal that I’m not the only person to have noticed this though. It is reported that if you add the keyword “timeout 60″ (or whatever value you want in seconds) then Munin will use this as a Global default timeout for the plugins. It is also reported that if you place this in the scope of your plugin configuration in /etc/munin/plugin-conf.d/<your plugin config file> like this:

[myplugin]
timeout 60
user root

That it will then only apply the timeout value to that plugin. It makes sense. It didn’t help me solve the problem with my CPU temp sensor, but it’s still useful to know what is going on behind the scenes.

Splunk 3.4: Memory Optimizations?

Saturday, January 3rd, 2009

My Splunk version 3.3.1 seems to have been having some issues with my SSHFS mounts - actually it was an underlying file system problem - so I decided to update to 3.4. Version 3.4.3 is out now, so I figure that the bugs in the first 3.4 release should have been fixed.

The major new features touted in 3.4 are the Windows compatbility and the addition of a light weight forwarding application. If you’re using the free license version of splunk then the forwarding application is rather meaningless anyway though, because the free license does not permit ingress of Splunk data from other servers. So in theory it’s a relatively meaningless update feature wise, I was simply hoping they had matured and optimized the code base.

The verdict? Well it looks like they have. I’ve only been running the updated version for about 12 hours, but it’s sitting on 20% less virtual memory and about 10% more real memory. What does that mean? Well the amount of memory that was commited to Splunk used to start off at about 600M and rise until plateuing at just over 1G. Committed memory is memory allocation requested by the application that is not necesserily used by the application. If it isn’t actually used by the application then the memory system can allocate more memory elsewhere. The general idea (hope) is that if the committed memory is actually called upon, then the kernel will be able to free up real memory elsewhere in order to resposnd to that real request.

In my personal opinion I think such large commitment of memory is silly. The Splunk application (3.3) was requesting 1G of memory, but only using 200M. Wha-? Committing 5x the amount of actual memory consumed?? It’s not the first time I’ve seen it, but it is the first time I’ve seen it in a (on my configuration) single threaded server application.

So now I have 220M of real memory in use by Splunk, which is fine, I guess it needs it and is doing something useful with it. It also has requested 800M of memory, so it’s still requesting just under 4x what it is using, but hey, it’s better than before! I wonder if they tuned the memory usage or just tuned something else that incidentally resulted in better memory usage…

Note that I haven’t changed anything in my configuration, I’ve just upgraded the Debian package and restarted Splunk and left it running for a while. The memory usage isn’t rising after the initial start so it seems to have stablilzed.

Oh by the way the file system problem was a result of SSHFS failing the SSH connection and not reconnecting correctly. Actually I already knew about that and I had crontab remounting the file systems every half an hour. Of course, with Splunk reading off the file system, they weren’t unmounting properly, which was also causing the remount to fail (-o remount does not seem to work with SSHFS). The solution was just to do a lazy unmount which allows the remount to work correctly (unmount -l /mnt/xxxx).