The RightScale alert system is built on top of the Monitoring System. The overall flow of events through the event system is as follows:
- The Monitoring System collects data and stores it in RRD database files on the RightScale monitoring server.
- The RightScale alert daemon runs every minute and evaluates alert conditions described in alert specifications using the monitoring data. If any alert condition is satisfied, the alert becomes active.
- Active alerts are tracked through an escalation process which determines what actions are initiated as time passes if the alert does not subside.
- A set of RightScale worker daemons process the actions to send email alerts, restart servers, run scripts on servers, or launch additional servers.
Alerts specifications and escalations are used to set up autoscaling a deployment. See How do I set up Autoscaling?
NOTE: The Alert System is a feature that is only available for Premium accounts. If you have a Developer account, you will
need to contact sales@rightscale.com to upgrade.
Alert Specifications
Alert specifications define the conditions under which an alert is raised based on monitoring data. What happens with the alert? e.g. whether an email is sent, a server restarted, etc. They are all defined in Alert Escalations. They specify when to raise an alert and which alert escalation list to feed it into.
Once alert specifications are defined, they can then be attached to Server Templates and to Servers. Each alert spec also names an escalation list, which must be defined either for the deployment in which an alerting server is running or for the account as a whole. In addition, the account has a 'default' escalation list which is a catch-all.
Each Alert Specification has the following parameters:
Name - a unique name for the alert specification.
- Description - a more detailed description of the alert's purpose and how to use it.
- File - the metric/condition that will be monitored for the alert. To see a full list of metrics, see List of Monitored Metrics.
- Variable - the type of variable that is measured. The type of variable option(s) will change depending on which file (metric) is selected. (ex: value, count, write, read, tx, rx, shortterm, midterm, longterm, ping, syst, user, minflt, majflt)
- Condition - (>, >=, <, <=, ==, !=) The condition and threshold define the trigger for the alert. If the condition and threshold are met, the alert will be raised. (ex: Raise an alert to shrink the array if the server's cpu-0/cpu-idle value is greater than 85% for at least 3 mintues.)
- Threshold - (percent, count, etc.) The condition and threshold define the trigger for the alert. If the
condition and threshold are met, the alert will be raised.
- Duration - (minutes) The amount of time that the condition must exist before an alert is raised. If a condition exists, but does not persist long enough to meet the specified duration, an alert will not be raised.
- Escalation - The name of the alert escalation that should be called if all conditions are met and an alert is raised. An alert escalation can be one action or a list of several actions. The alert escalation must be defined on the the deployment of a server or on the account. A "default" list is used if there are no matches.
Alert Escalations
An alert escalation describes the steps that need to be taken in response to an actual alert, as defined in an Alert Specification.
An alert escalation is identified by name and can either be attached to a specific deployment or to the account as a whole. This means that if an alert is triggered and its specification names the "critical" escalation then the deployment of the server triggering the alert is first searched for an escalation by that name, and then the account. If neither have a "critical" escalation, the "default" escalation is used.
For each escalation, several actions can be defined. For example, the critical alert might first email the primary sysadmin's cell phone every 10 minutes. If the alert is still active after 60 minutes, the alert is escalated to the next action which might email all sysadmins every 60 minutes. When this escalation to the second action occurs, the first action can be set to stop or to keep running, in which case the primary sysadmin would still receive an email every 10 minutes.
The supported actions are:
- send email - Send email to one or multiple recipients
- reboot_server - Reboot the server generating the alert
- relaunch_server - Relaunch the server generating the alert (this terminates the current instance and launches a fresh instance)
- run_right_script - Run a RightScript on the server generating the alert
- vote_grow_array - Vote to grow a server array attached to the deployment in which the alerting server is running.
- vote_shrink_array - Vote to shrink a server array attached to the deployment in which the alerting server is running.
To learn how to set up alerts and escalations in order to scale-up or scale-down your deployment, please see How do I set up Autoscaling?