In order to collect data for applications that are not handled by collectd's standard plug-ins you'll need to write your own custom plug-ins.
There are three methods for adding plug-ins to collectd:
We recommend using the third approach because we've found it to be the most stable and easiest to test and troubleshoot.
Plug-ins that are written in C tend to have better performance, but unless you're collecting a lot of data the language doesn't seem to make a significant difference. We've had compatibility issues with plugins written to the Perl interpreter that can be embedded into the collectd process. We recommend using "exec plug-ins" described on collectd.org.
The operation of exec plug-ins is not fully documented on the collectd site, but there are a few things that you need to know. In the collectd config you simply specify that collectd should launch an executable as a separate process, read monitoring data apearing on the output of this process, and process this data like all the other collected data. It is the responsibility of the plugin executable to determine where to get the data from, to collect it at the appropriate interval, to name it correctly, and to write it to standard output. Should the plugin executable terminate, collectd will restart it afresh.
Let's try a simple example: a plug-in that collects cpu load by calling 'uptime' and parsing the output (no this is not the best way to do this, but it's a simple example).
#!/usr/bin/env ruby
require 'getoptlong'
# The name of the collectd plugin, something like apache, memory, mysql, interface, ...
PLUGIN_NAME = 'mycpuload'
def usage
puts("#{$0} -h <host_id> [-i <sampling_interval>]")
exit
end
# Main
begin
# Produce a log file, in the future, collectd will do the right thing with STDERR
STDERR.reopen(File.new("/var/log/#{PLUGIN_NAME}_plugin", File::RDWR | File::CREAT))
STDERR.sync = true
# Parse command line options
hostname = nil
sampling_interval = 20 # sec, Default value
opts = GetoptLong.new(
[ '--hostid', '-h', GetoptLong::REQUIRED_ARGUMENT ],
[ '--sampling-interval', '-i', GetoptLong::OPTIONAL_ARGUMENT ]
)
opts.each do |opt, arg|
case opt
when '--hostid'
hostname = arg
when '--sampling-interval'
sampling_interval = arg.to_i
end
end
usage if !hostname
# Collection loop
while true do
start_run = Time.now.to_i
next_run = start_run + sampling_interval
# collectd data and print the values
data = `uptime`[/load average: ([\d.]+)/, 1] # get 5-minute load average
puts("PUTVAL #{hostname}/#{PLUGIN_NAME}/gauge-5_minute_load #{start_run}:#{data}")
# sleep to make the interval
while((time_left = (next_run - Time.now.to_i)) > 0) do
sleep(time_left)
end
end
end
If we go ahead and simply run this from the command line, this is what it looks like:
root@ip-10-251-70-47:/tmp] ./mycpuload -h i-123456 PUTVAL i-123456/mycpuload/gauge-5_minute_load 1207188959:0.01 PUTVAL i-123456/mycpuload/gauge-5_minute_load 1207188979:0.00 PUTVAL i-123456/mycpuload/gauge-5_minute_load 1207188999:0.08 ./mycpuload:46:in `sleep': Interrupt from ./mycpuload:46 [root@ip-10-251-70-47:/tmp]
If you look carefuly at the timestamps (the 10-digit number), you'll notice that the three lines are all 20 seconds apart. The number at the end of the line is the load reported by uptime.
Now it's time to explain what the i-123456/mycpuload/gauge-5_minute_load string represents. The format of this string is <instance-id>/<plugin>-<plugin_instance>/<type>-<type_instance>. The meaning of each field is:
All this is pretty mysterious at first and note how a '-' separates plugin and plugin_instance or type and type_instance, while an '_' is sometimes used within any of these four items. The best way to understand how all this works to look at where each of these identifiers shows up on the web pages.
... to be continued with examples and screenshots ...
To the run the program "uptime" requires a fork. Just read the file /proc/loadavg it is lots faster.
The first value is the one you want, very simple fixed first 4 characters of the string returned.
$ cat /proc/loadavg
0.00 0.00 0.00 1/66 16862
NOTE: "cat" is also a program, This is just an example. Your code will just open a file.
data = File.open("/proc/loadavg","r") { |f| f.gets }[0..4]
edited 17:58, 7 May 2008
If so could you please post instructions?
My use case:
I have a cron job that is copying a directory from one server to another using rsync.
Since this is crucial data, I'd like to be notified when this transfer fails, either because crond is down, the crontab has problems or rsyncreports some kind of error.
Any advice? The files available to Alert specs do not contain any data that could help me, so I guess I'll have to come up with something on my own.
Unfortunately the good old method of cron sending an email when a job fails (= produces output) doesn't work for me either - my email provider blocks all emails coming from the EC2 IP address range. Of course I could set up a proper mail relay but that seems to be a lot of work just to work around Rightscale's built-in monitoring capabilities...