SysOrb Network Monitoring System User's Guide: For version 4.6.0 | ||
---|---|---|
Prev | Chapter 1. System Overview | Next |
In SysOrb there are two alert strategies. The most simple one is the immediate strategy. When a node or a check is configured to use the immediate strategy, SysOrb will send out an alert the first time the check or node fails. This is especially useful for checks such as uptime, process presence, RAID status and other checks which have no natural fluctuation (that is, just one bad reading means that something is really wrong).
The other alert strategy is ScoreKeeper. ScoreKeeper is useful when SysOrb is monitoring something that sporadically peaks above the set limits. For example, if the network response time to some remote node is monitored, and the maximal delay is configured to be 10 ms, we do not usually want an alert if the delay sporadically exceeds this limit. It may be just one single packet that was delayed, so even though it may be way above the acceptable delay limit, it is still acceptable because it was a one-time incident.
In order to work with this kind of fuzzy limits, a score is kept for each node and each of it's checks.
The ideal score is 0. A score can never go below 0. If a check results in a warning, the score will be incremented with some specified value, until the Warn Ceiling is hit. If a check results in an alert, the score is increased further until the Alert Ceiling is hit. The score can never exceed the Alert Ceiling. If the check succeeds, the score will be decremented with some value, until it reaches 0 again.
All the scores are updated every five seconds. The scores are updated based on the last known value from each check.
When you choose to configure a new node in the SysOrb Network Monitoring System, you will be given the choice of specifying the mentioned limits and increment / decrement values. The node's limits are used as limits for all the checks on the node. The increment / decrement values contrarily, are defined for each check on the node. This is because they are used to increment / decrement the score on the associated check only. All the check scores are checked against the node's limits. If the limits are exceeded an alert can be issued. The default values are perfectly appropriate for most uses, so you need not try to understand the deeper relationship between the values and the way warnings and alerts will work. If you do, however, decide to change these values, you have to be certain of the consequences of the changes as a mis-configured system is a lot worse than a system running with a sub-optimal configuration.
We recommend that you leave the score and warning / alert / ceiling values to their defaults, and only change them if you both understand what the change will mean, and actually have a need to change them.
Sometimes a check may enter alert state and leave again so quickly, that you may risk nobody notices. Of course, if SysOrb is configured to send out email alerts, it will send one no matter how brief the alert state was. But the email can get lost, or the user may have requested not to receive more that one SysOrb mail every half hour, and if he has got one recently, he may never receive an email for this particular alert.
To be absolutely sure, that no alerts go unnoticed, SysOrb allows an option called Alert Acknowledgement to be enabled per check. When that is enabled, the check will stay in alert state, even if the original cause of the problem goes away.
This could be useful on a process presence check for instance, if the operating system restarts a given process after a short period of time, but you still want to notice, if it has been missing. In that case enabling Alert Acknowledgement will result in a red icon showing up as soon as the process stops, and staying red even when the process starts again. After the process has started you must explicitly acknowledge the alert through the web interface for the icon to revert to green.
Some checks (currently only LogChecks) makes no sense without alert acknowledgement enabled. That is so because a bad log line appearing in the log, is treated by SysOrb as an instantaneous alert state, returning immediately to good state awaiting the next line to be appended to the log. With alert acknowledgement enabled it will of course stay in alert state until a user explicitly acknowledges.