We changed our name from IT Central Station: Here's why

When evaluating Server Monitoring, what aspect do you think is the most important to look for?

Let the community know what you think. Share your opinions now!

ITCS user
1212 Answers

author avatar
Top 5LeaderboardReal User

Server is a rather vague term in these days of virtualization but also gets to the point, context.  If you are monitoring any entity, that entities context in the environment is the most important thing to consider which means you need to understand the role of the "server", who will consume the information, and how to put data in context.  In todays markets, capturing data is typically not your issue.  Your issues are presenting the data in a context for your users to be useful.  A server admin who typically buys the monitor for the server has to consider the developer and business analyst requirements, not just their own.  For example, is a windows server running at 90% meaningful?  Is a jump in I/O latency from 10 -> 100 ms important?  Is a process that consumed 2% cpu, jumping to 10% CPU going to cause you a problem?  How about memory paging....  I think the point is obvious.  So you have to understand the context of information shared with the user consuming this information.

Each of the above are cases I have seen that at times it is important and others not so much....

- 60% cpu on a multi-threaded host but cpu0 is pegged, limiting it's ability to delegate work to other cpu or the process is single threaded.

- 10 - 100 ms jump in I/O latency.... not for a non-prod server writing to legacy disk frames with a lower SLA

- Jump from 2% - 10% cpu for a process, a process which is running on 50 virtual instances on a single physical host now consumes multiple physical CPU, very expensive.  This view is typically more important to someone doing capacity analysis which may or may not be the server admin.

I split context into 2 capabilities; (1) - observe-ability (manually putting things together) and (2) AI - a machine helping give context

(1) Observe-ability: How easy is it to express your metrics in a graphical format?  This may include ability to find your data, having multiple options to visualize (bar, line, top N, hot spot, ...), personalize the view, and finally share the view.  You get a bonus if there are ways to simplify "re-using" the dashboard template, simply updating the context while the tiles are still meaningful.  Think about creating a view for an app, then replicating that and each team can easily change to their personal context.  The interface needs to be usable by people that are not server admins.  Don't underestimate the complexity for someone to "find" their data, especially if they don't administer the platforms.

(2) AI: How does a machine help me understand the context of information.  This may have other applications but alarming is the easiest to consider.  If I can quickly correlate the a CPU utilization alarm to a business transaction that is running slow, I have context.  A typical server admin has no concept of the applications running, at least not any any decent sized company.  In the same way a developer doesn't have sensitivity to the utilization of an underlying hosts (we won't even consider the physical host) if they are running in any kind of shared infra.  The alarms generated, to be meaningful, should be presented in the context of the critical business transactions running on the server.  The AI needs to assist both admins and developers (or a devOps engineer) to become aware of the event and quickly triage through impact and root cause analysis.  Improving MTTR should be the goal of any event.

author avatar
Top 5Real User

Well, first there's a lot of different types of servers and domains they serve, so it depends what the purpose of the server is. 
The physical server components will stop it providing any service if they fail. So its easy to see that its vital to monitor them all. But if that server is in a cluster, where other physical servers take over when it does fail, then the importance of monitoring each component of each diminishes.
If the server is in the cloud, private or public, then maybe you're not the owner and just pay a fee, in which case monitoring isn't of much to you. Maybe. If the availability and performance of the service you subscribe to can impact YOUR business, then monitoring to see how the service is delivering what you pay for should be considered.
If the "server" is NOT physical but virtual, then the objects you monitor will be similar but others that impact the virtual hosting need to be added. 
If its a database, web, application, or other type sof servers of which there are many, then the type and list of monitored data to detect adverse trends, detect resources starvation will be different each time.
Whatever the "server", we also cite the old adage "You can't manage what you do measure".
Even if the server isn't in a business or commercial perspective, everyone is judged sooner or later on the capacity to get the job done in a reasonable time. Monitoring should provide just enough visibility of anything that might stop that objective from being met
Thus what's important to look for is whatever data monitoring can provide that will reduce the risk of 1. compromise from a security perspective, 2. business impact through resource starvation, and 3. changes that can impact the behaviour intended.

author avatar
Real User

There multiple angles that the consultant to look for on the Monitoring Per-se, let me list few.

1.Having separate tool to monitor Server/Network and so on .. is traditional method and this no more a value proposition .. look for a tool which can do a full stack monitoring of the environment. The reason for this is because this'll reduce the unnecessary integration efforts and chopped data due to multiple integration points. And this makes sure the data flow is seamless wherein it helps to manage environment from a single console.

2.The product selection should allow to extend to the AI based Methods as it going to create a huge impact in infra operations. And how complex it is to build is also a question but it always good to start as you don't need to be left out on the AI Ops race.

3.The product implementations should be completed based on the Docker/Container images which helps in scaling of the monitoring solution horizontally.

4.Strong Event Management should available to help in all event correlation and duplication.. so on.. It is considered to be obsolete in future but I believe it is going to be there for some time until the things gets matured in Deep and Machine learning algorithms.

5.Integration capabilities with third party systems(API,SNMP,TCP,Log)

6.Finally the cost plays a major role and see what you want in the environment. Product selection should be based upon to solve & proactively detect the issues in your environment and to add above values.

There are other pointer like ease of use,support,user experience .. so on which is must for any products...

Hope it helps!!

author avatar

There are 4 things you should have in mind when looking for a monitoring system.

1. Do not take the articles that review and compare multiple monitoring systems too seriously. These articles usually focus too much on how many sensors a system delivers and too litle about what really matters.

2. Look more at the stuff that lives forever; how the monitoring system handles data.

- What capabilities does it have when it comes to dealing with dependencies?
- Does it store data in a way that makes it easy to implement AI?
- How well can it handle notifications?
- How scalable is it?
- How easy is it to implement custom sensors?
- Does it have any features that are useful that other monitoring systems does not have?

Bjørn Willy Stokkenes, the architect of Probeturion wrote an interesting article about these things on LinkedIn:

3. Do the vendor deliver proper support
- Do they answer quickly
- Do they understand your questions or do they make you send a lot of unrelated information about your settings and so on?
- Do they offer to support you in setting up your monitoring system?
- Do they offer to build custom sensors for you?

4. Do not get fooled by a low price. Remember, you and your workers time are worth a lot of money. Sometimes saving 90% cost in purchase of an IT system can make you loose 100 times more in wasted man-hours.

author avatar

I think there are three things that should be considered along with the other comments here:

CONTEXT - what else connected to that server is being monitored? Diagnosing faults can be tricky and it's made much for difficult if you have to go from one monitoring tool for the server to (many?) others for all the devices connected to that server. A tool that shows that server in context with all the things it's connected to can make diagnosing network issues simple.

SELF-HEALING - half the time the tried-and-true power cycling of the device in question solves the problem. If the admin understands the system and knows that the server will occasionally require rebooting, why wake him up at 2am? The monitoring solution should be able to automatically execute self-healing actions like this based on preset conditions. This makes the difference between a 2AM call and a note in the admin's inbox when he gets in the next morning.

PROACTIVE ALERTS - if the user notices the network is down you're already losing money and gaining ill-will. A good monitoring tool will let you know when failures are about to happen and alert you before they start impacting your users.

And finally, as one additional last thought, it's nice to have a monitoring tool that will alert the entire IT team via something like Slack in case the admin in question is unable to respond in a timely manner.

author avatar
Real User

Security around protocols supported and what's not supported that relates to security, i.e. FIPS, etc.

What OSes and databases are supported; for capacity planning and clustering support.

What technologies can be monitored.

author avatar
Top 5User

Updated product (or one that continues to get regular updates), ease of use, and aesthetically pleasing.

author avatar
Real User

IMO I like to engage the app/system/service owners and ask them what they want to see monitored. The experts are usually going to be those who built the service you are monitoring. Since an engineer is going to get the call at 2 AM when the alarm you set up trips, its important to work close with them also so you can iron out what is a good threshold for the warning and then alarm. Engage the NOC and see if there is 1st level support they can do to avoid that 2 AM call. I stick with a default base template constructed by the OS vendor's recommendations and then we tweak it to be more accurate for our environment. Server / OS monitoring is pretty standard across the board, I find its the application / service monitoring that takes a lot more thought. In the end the one question that usually wraps up the meeting. When do you want me to wake you up at 2 AM? What condition on the system warrants this call? When do you want me to send an automatic email for awareness? When do you want a ticket and email only? Every organization will have their own method for monitoring and it should be an ever growing and evolving process. Every outage should have an RCA and the monitors should be reviewed. Did we know this was coming? Could we have alerted sooner and avoided user impact? How should we monitor going forward.

author avatar

Our servers are so different in terms of monitoring protocols! Some of them support SNMP, some SSH, some neither, so you need to install some kind of agent. And for all of them we need to monitor CPU Load, Memory Usage, Disk Usage, Bandwidth, Cloud Services, Web Page/Site Responsiveness, VoIP, SQL,SSH, FTP, HTTP/HTTPS... We tried several tools including the described ones. But finally we found CloudView NMS http://www.cloudviewnms.com which actually had the set of features we needed out of box. It is universal and combines both network and server monitoring.

author avatar

1. Learning Curve. If low, various monitoring users can themselves build fine tune the monitoring, making you as less of bottle neck.
2. API integration capabilities, specifically with ticketing tool along with telegram or other such tool and for report generation.
3. Quality of Support for the tool. If there are issues how quickly can it be resolved. Since you are monitoring the environment your down time is supposed to be in minutes not in hours/days.
4. Automation capabilities. For devops automated provisioning and decommissioning and auto-correction (self-healing).

I wouldn't stress about proactive alerts as those are very very basic capabilities and should exist by default in any monitoring tool.

Would recommend Zabbix, Grafana, Ansible to get started with.

author avatar

I think the most important are:
Processor Utilization
Memory Utilization
Disk Utilization

author avatar
Real User

The most important is the trending and also allow multiple ways of alerting:

A. Paging
B. Case
C. Mail

Find out what your peers are saying about Zabbix, Microsoft, SevOne and others in Server Monitoring. Updated: January 2022.
564,322 professionals have used our research since 2012.