RightScale Support Wiki > References > Monitoring System > Writing custom collectd plug-ins

Writing custom collectd plug-ins

In order to collect data for applications that are not handled by collectd's standard plug-ins you'll need to write your own custom plug-ins.

Adding Plugins to collectd 

There are three methods for adding plug-ins to collectd:

  1. Write native plug-ins written in C
  2. Write plug-ins in perl loaded into collectd
  3. Write external plug-ins

We recommend using the third approach because we've found it to be the most stable and easiest to test and troubleshoot.

Plug-ins that are written in C tend to have better performance, but unless you're collecting a lot of data the language doesn't seem to make a significant difference.  We've had compatibility issues with plugins written to the Perl interpreter that can be embedded into the collectd process. We recommend using "exec plug-ins" described on collectd.org.

The operation of exec plug-ins is not fully documented on the collectd site, but there are a few things that you need to know.  In the collectd config you simply specify that collectd should launch an executable as a separate process, read monitoring data apearing on the output of this process, and process this data like all the other collected data. It is the responsibility of the plugin executable to determine where to get the data from, to collect it at the appropriate interval, to name it correctly, and to write it to standard output. Should the plugin executable terminate, collectd will restart it afresh.

Example 

Let's try a simple example: a plug-in that collects cpu load by calling 'uptime' and parsing the output (no this is not the best way to do this, but it's a simple example).

#!/usr/bin/env ruby
require 'getoptlong'

# The name of the collectd plugin, something like apache, memory, mysql, interface, ...
PLUGIN_NAME = 'mycpuload'

def usage
  puts("#{$0} -h <host_id> [-i <sampling_interval>]")
  exit
end

# Main
begin
  # Produce a log file, in the future, collectd will do the right thing with STDERR
  STDERR.reopen(File.new("/var/log/#{PLUGIN_NAME}_plugin", File::RDWR | File::CREAT))
  STDERR.sync = true

  # Parse command line options
  hostname = nil
  sampling_interval = 20  # sec, Default value
  opts = GetoptLong.new(
    [ '--hostid', '-h', GetoptLong::REQUIRED_ARGUMENT ],
    [ '--sampling-interval', '-i',  GetoptLong::OPTIONAL_ARGUMENT ]
  )
  opts.each do |opt, arg|
    case opt
      when '--hostid'
        hostname = arg
      when '--sampling-interval'
        sampling_interval = arg.to_i
    end
  end
  usage if !hostname

  # Collection loop
  while true do
    start_run = Time.now.to_i
    next_run = start_run + sampling_interval

    # collectd data and print the values
    data = `uptime`[/load average: ([\d.]+)/, 1] # get 5-minute load average
    puts("PUTVAL #{hostname}/#{PLUGIN_NAME}/gauge-5_minute_load #{start_run}:#{data}")

    # sleep to make the interval
    while((time_left = (next_run - Time.now.to_i)) > 0) do
      sleep(time_left)
    end
  end
end

If we go ahead and simply run this from the command line, this is what it looks like:

 root@ip-10-251-70-47:/tmp] ./mycpuload -h i-123456
PUTVAL i-123456/mycpuload/gauge-5_minute_load 1207188959:0.01
PUTVAL i-123456/mycpuload/gauge-5_minute_load 1207188979:0.00
PUTVAL i-123456/mycpuload/gauge-5_minute_load 1207188999:0.08
./mycpuload:46:in `sleep': Interrupt
	from ./mycpuload:46
[root@ip-10-251-70-47:/tmp] 

If you look carefuly at the timestamps (the 10-digit number), you'll notice that the three lines are all 20 seconds apart. The number at the end of the line is the load reported by uptime.

Now it's time to explain what the i-123456/mycpuload/gauge-5_minute_load string represents. The format of this string is <instance-id>/<plugin>-<plugin_instance>/<type>-<type_instance>. The meaning of each field is:

  • instance-id: the AWS ID of the instance so the data can be filed-away correctly on the server
  • plugin: identifies the plugin which is typically associated with an application or a resource, examples are apache, mysql, squid, cpu, memory, etc..
  • plugin_instance: identifies the instance of an application/resource when there are multiple, examples are cpu-0, cpu-1 on dual-core servers, or df-mnt and df-root for the two filesystems on small instances.
  • type: identifies the type of data being collected, which determines how the values are interpreted and how the graphs are plotted, this is explained in more detail below
  • type-instance: the name of the variable being collected, or the instance of the variable of the given type being collected, examples are: (for the cpu type) idle, wait, busy; (for the mysql_command type) selects, updates, executes.

All this is pretty mysterious at first and note how a '-' separates plugin and plugin_instance or type and type_instance, while an '_' is sometimes used within any of these four items. The best way to understand how all this works to look at where each of these identifiers shows up on the web pages.

  • each <plugin>-<plugin_instance> combination results in a menu box at the top of the monitoring page
  • the <type> determines how the data is interpreted and how the graph looks like
  • the <type_instance> shows up in the title of all the graph

... to be continued with examples and screenshots ... 

Tag page
Viewing 2 of 2 comments: view all
Please alter this code to read the file: /proc/loadavg
To the run the program "uptime" requires a fork. Just read the file /proc/loadavg it is lots faster.
The first value is the one you want, very simple fixed first 4 characters of the string returned.

$ cat /proc/loadavg
0.00 0.00 0.00 1/66 16862

NOTE: "cat" is also a program, This is just an example. Your code will just open a file.

data = File.open("/proc/loadavg","r") { |f| f.gets }[0..4]

edited 17:58, 7 May 2008
Posted 17:39, 7 May 2008
Is it possible to include the newly collected data into Alert specs?
If so could you please post instructions?

My use case:
I have a cron job that is copying a directory from one server to another using rsync.
Since this is crucial data, I'd like to be notified when this transfer fails, either because crond is down, the crontab has problems or rsyncreports some kind of error.
Any advice? The files available to Alert specs do not contain any data that could help me, so I guess I'll have to come up with something on my own.

Unfortunately the good old method of cron sending an email when a job fails (= produces output) doesn't work for me either - my email provider blocks all emails coming from the EC2 IP address range. Of course I could set up a proper mail relay but that seems to be a lot of work just to work around Rightscale's built-in monitoring capabilities...
Posted 16:54, 9 Oct 2008
Viewing 2 of 2 comments: view all
You must login to post a comment.