This time I’d like to share some insights we observed when we helped our first users to install our monitoring & alerting tool for ClearCase, Jenkins and ClearQuest.
A threshold is the point at which your system begins to experience difficulties. For example: high memory usage in your Jenkins master or a low number of available licenses in your ClearCase / ClearQuest environments. That is when you need to be alerted so you can take the necessary actions.
During installations in customer sites we noticed that most customers were clicking “Next” when it was time to configure these parameters. They usually relied on the default configuration suggested by the vendor, which is good in many cases. However, by doing so they were potentially throwing away one of the most useful features your monitoring and alerting tool provides: Error and Warning notifications. Furthermore, using thresholds that are not optimized to your environment can actually diminish your ability to gain control over your system. If you have a 1 TB disk that stores all your VOBs and the vendor chose default parameters suitable for a 2TB disk, you will never be alerted that you are about to run out of space until it is too late, and you have gotten VOBs corruption.
Setting critical thresholds
The first and most important threshold is the error threshold. When this threshold is reached you must deal with the issue or you might suffer performance degradation, downtime or even corruption of data.
Things to consider when defining a threshold for error notifications:
- What is the issue you wish to avoid? (high CPU usage on Jenkins host, disk space or licenses on ClearCase / ClearQuest hosts)
- How much time do you need to response to the notification, taking into account things like:
- Do I need to shut down the system for the procedure?
- Which other systems rely on the server. (Jenkins masters, Jenkins slaves, ClearCase clients, etc.)
- If other teams are involved, then how long does it take to get service from the other teams?
- How much time do you need on the system?
- Make sure that your resourcescan continue performing properly and does not reach critical state while you handle the issue.
- Take weekends into consideration.
What to consider when defining a threshold for warning notifications:
Warning thresholds are there to let you know that you are nearing a situation that will impact your system’s performance.
It allows you to to not caught off-guard and gives you more time to handle the upcoming issue.
The use of 2 thresholds helps to increase awareness of the system status and improves the chances of handling issues beforehand and avoiding performance degradation and even downtime. This is especially important when monitoring tools like Jenkins, ClearCase and ClearQuest.
Let’s see how we set thresholds
I have a ClearCase or Jenkins host that has 1TB of storage. When the host uses more than 80% of its storage capacity, I get, in the case of a ClearCase host, VOB corruption and my Jenkins host will start experiencing performance degradation.
I also know that on average, there are 1GB of new data stored and used by the system every day, and lastly I know that adding new storage to my system is a procedure that takes 24 hours.
We always start with the error threshold because it is the important threshold
Setting an error threshold
When I take all of the above in account I see that
- The system can not reach 800GB.I need to reserve 1GB for the 24 hours it takes to add new storage to the host.
- I need to reserve 2GB to make sure that even if the alert is received during the weekend I still have enough time to address it.
So I will set the error threshold at either 795GB or 79%.
Setting a warning threshold:
Setting the warning threshold is primarily to let you know that you are nearing one of your critical thresholds.
24 hours advance notice is a good period of time to prepare and adjust your schedule (if needed) to handle the upcoming issue, perhaps by sending an email to the storage team asking to allocate more space to the host.
So I will set the warning threshold at either 790GB or 78%.
It is important to remember that a threshold changes with the system. If any of the parameters you use to set the threshold changes, you need to change it respectively
With our performance monitoring and alerting tool ALM Vitality, we encourage our users to take a moment and set the correct threshold that is suitable to their specific environment, by enabling them to set at least 1 notification for the usage of IT components like CPU, memory and storage and application based thresholds, for example available licenses in a ClearQuest / ClearCase environments.
ALM Vitality alerts you every time your host reaches one of its thresholds both via the dashboard and via an email notification with all the relevant information already inside.
- Click here to learn more about the ALM Vitality
- Watch recorded webinars on monitoring performance of Jenkins, ClearCase & ClearQuest
About the author:
David-Or Cohen is the product manager for ALM Vitality.
Prior to this position David was the world wide interoperability manager for the IBM-XIV storage device and a strategic consultant for all IT infrastructure and DevOps methodologies.
Outside of work, David’s passion is kayaking, both sea and whitewater kayaking, and his motto is “it is better to die in your kayak than swim”.