'Hardware' monitoring for Zabbix is trivial as that is what it was designed for. However, just because it can be monitored, doesn't mean you care. This section discusses how to discover what you care about and how to monitor those items (and remove the rest of the noise)
Edit me

Defining Hardware

Before we get into too much of this, let’s define hardware in the context of Zabbix Monitoring. You would think something like this would be simple, CPU is a physical chip, so it is hardware. However, when you start to get into the details of monitoring it, a lot of questions arise. For example, how do you collect the metrics? Perhaps more importantly, what metrics do you care about?

Obviously, you can only collect metrics that have values, and those values can only be collected by a collector that has a way to measure them. So saying “I want to know how much power my CPU has available” is silly. However, saying “I want to measure how much power my CPU is consuming” is possible, and may be desirable.

Many engineers have taken the time to collect and measure many of these metrics and they build those collectors right into the motherboard or other related hardware so they can be exposed to monitoring software. Often that software is the operating system itself. It is not ideal to collect metrics on the system you are using by using a subsystem that is dependent on that system, however, it is convenient, and in the case of computers, often the only way we have available to us. Due to this convenience and the ubiquity of such metrics collection points, software, such as Zabbix, often uses this as the primary data collection means for the underlying hardware. What this means is that the metrics received are not always true indicators of what we think we are monitoring; however, decades of experience has provided enough IT Engineers the knowledge and experience to identify and calculate values that are indicative of what we want to monitor. Say What?!?!

Let’s take a made up example to explain what I mean. The core temperature of an electronic device directly affects it’s ability to perform efficiently. Generally speaking, the hotter an electronic circuit is, the less efficient it is. Inside the CPU of a computer, the electricity running through the circuits generate heat. The more processing the CPU is doing, the more heat it generates. Internal CPU temperature may be as high as 100 degrees Centigrade or more. The temperature probe that measure the CPU temperature is between the heat sink and the CPU, however. So the thermometer used, may show a temperature of only 95 degrees Centigrade. However, since we know we want the internal CPU temperature below 75 degrees Centigrade, the thermometer is indicative of the high heat situation internally.

Note: In a real CPU, they measure temperature as a physical property of the chip as calculated using current resistance. This is far more accurate and effective. The above was just to illustrate how an indicative value can effectively replace a true value.

Why do I care if the value is true or indicative? The reason I bring up this point is because what is actually measured, and how it is measured, directly relates to what you are monitoring. For example, if I want to know the Average CPU load (a common concern), then it is important to define what determines a load. In it’s most literal sense, a load would be when a CPU is processing a calculation. However, in that context, average load does not make sense. It is either processing or it isn’t. So then you say, well, over the last 10 minutes, how much was it processing? That gives you context that can give you a result, but is it really what you care about? Cutting to the chase, the indicative value used for CPU load in most operating systems (and indeed most monitoring systems) is something completely unrelated to the CPU processing. Instead, they look a the average ready queue size over time of the CPU. If they queue size, over time, is consistently growing, then your CPU will be busy longer, or perhaps is in a looping condition or other problem preventing the CPU from processing other requests. In the end, what you want to know when you are asking about Average CPU load time, is two things: “Is the CPU functioning normally?” and “Do I have enough processing power to process what I am sending to the CPU”. The indicative value of Average Ready Queue Size provides that answer.

But what if you really want to know something else that the true value provides, that the indicative value does not? Perhaps there is a different indicative value you can use. Perhaps there is a way to get the true value that is outside of the typical metrics people use. The point is, knowing what you want AND how it is obtained, does matter; and you should question any values you obtain to ensure they answer the question you are asking.

Getting back to Zabbix and defining the hardware - Zabbix, like most monitoring solutions pulls the metrics from the OS. So when we say we are monitoring ‘hardware’ what we are really saying is we are monitoring what the Operating System says it sees from the hardware. This can get particularly complicated when you are dealing with virtualization and especially nested virtualization. Unless you have access to the actual hypervisor OS that is running on the physical hardware, you may not even have access to many of the metrics that are needed to monitor the hardware effectively. Keeping this in mind when determining what to measure will aid greatly in determining what hardware to monitor, and how.

Metrics can come into Zabbix using at least two different common approaches - SNMP traps and/or Agent Queries. (There are other methods, but they are very uncommon and outside the scope of this document). Agents are just what it sounds like. It is a small, efficient piece of software that is installed on each machine being monitored to provide detailed information about that server or device. It often allows for MUCH greater detail and in the case of Zabbix, at least, is extensible to provide a mechanism for Zabbix to monitor just about anything you are able to differentiate. Agents, however, are often derided due to the additional resource utilization they require, the potential for an additional attack vector, and for the high touch requirements. As such, many organizations prefer the “agentless” approach that SNMP offers. SNMP traps are a structured service built into most operating systems that provide a standardized set of metrics. This service, effectively is the agent for any monitoring solution that can capture these “traps”. They do not provide the level of granularity an agent provides, nor are they extensible; however, in many cases, they provide enough information for basic monitoring, and for most non-server devices, they may be the only option for monitoring at all.


Tags: zabbix