This device has semi-active air colling, which means active cooling is turned on only after GPU temperature reaches a certain threshold — a nice feature which eliminates extra noise in case GPU is loaded only occasionally. For this particular device the threshold is around 60°C. Hardware, like design of radiators, of course, plays a big role in implementation of this feature, but it’s not a solely hardware feature, however: automatic cooling management is provided by software, which is graphics driver.

Available APIs

First things first, it’s important to know there are 2 official APIs:

Let’s take a brief look at their key abilities and differences.

NVML

This API is provided specifically for monitoring and management purposes. It is used by nvidia-smi command-line utility, which can give reference metrics when validation of API usage correctness is needed. And this is one of its major advantages. SMI output example is shown below:

Figure 2 —Example output of nvidia-smi

The output is split into 2 parts: a header with a GPU list and a process list. The header contains information about versions of drivers and APIs used. GPU list enumerates detected GPU devices and reports most several metrics per GPU:

intended fan speed (in % of max RPM) for all fans present on a device;

actual chip temperature (in °C);

actual power used and power cap (in W) for GPU, device’s memory and surrounding circuitry;

actual memory utilization (in %);

actual GPU utilization (in %).

These are the core metrics and they are enough for the example part of this article.

In case of a Linux-based system as a target platform, NVML is the only option. Availability on Linux-based systems is another major advantage of this API.

As for distribution, NVML is available as a .dll for Windows and already comes installed with graphics drivers. Usually it’s available at:

C:\Program Files\NVIDIA Corporation\NVSMI

vml.dll

The command-line utility nvidia-smi is located in the same directory.

As for Linux-based systems, nvidia-smi also comes installed with graphics drivers. NVML is available as .so , but has to be installed separately however. For example, for CentOS 7, this library is provided by nvidia-x11-drv-libs package which is available in ELRepo repository. After the installation, the library for 64-bit systems is usually available at:

/usr/lib64/libnvidia-ml.so

The actual name may vary, though, as driver’s version is appended after .so extension. E.g., with drivers of version 440.31 the full path will be:

/usr/lib64/libnvidia-ml.so.440.31

Hence, another advantage of NVML is its usage of the most recent drivers.

Another plus is all functions described in documentation are exported objects in a shared library. Hence, they can be listed directly by examining a library and they can be loaded by names which is pretty convenient.

Additionally, Nvidia provides official Perl and Python bindings for NVML available at CPAN and PyPI as nvidia-ml-pl and nvidia-ml-py respectively. There are no docs for the Python package, though, so one will have to download its sources and examine them at least to get correct package name to use in import statements.

However, in spite of having many advantages, NVML has at least a couple of drawbacks.

One such a drawback is absence of header files with definitions of data types and functions. To get them one will have to define them manually, which can become pretty annoying. Fortunately, all info needed is described in reference docs, so there’s no need to guess what to write.

Another drawback of NVML is that it does not allow manual fan control: one can look at fan speeds but cannot touch them. That’s quite awful, as NVML is the only API available on Linux-based systems.

A digress on Nvidia drivers and Linux-based systems

Just in case one goes to spawn a Linux-based server and utilize an Nvidia GPU for computation, there’s one little thing to keep in mind:

To make certain functionality work, e.g. automatic fan control, X server must be installed and running.

That’s a pretty big slip. Alas.

NVAPI

If one does ask to describe NVAPI in a couple of words, the answer would be the following:

it’s available both as a static and as a dynamic library;

its dynamic library provides hidden undocumented functionality which is not available in the static version;

certain measurement units are different from those used in NVML;

it’s more fine-grained and more featureful than NVML for certain use-cases;

it’s exclusively for Windows.

Firstly, a little note on distribution of libraries: Nvidia’s download center allows people registered as developers to download an SDK. It provides all necessary C headers and static libraries precompiled for both 32- and 64-bit architectures. All data types and functions defined in headers are documented and are pretty self-explanatory. All of that provides a pleasant development experience.

What one is unlikely to get to know from official documents is about th existence of NVAPI’s dynamic library — nvapi.dll . This DLL is installed with graphics driver and typically is available as:

C:\Windows\System32

vapi.dll

Obviously, no docs and no headers are available and it may be nothing wrong with calling usage of nvapi.dll a slight version of hacking. Or masochism.

The only function exported by nvapi.dll is nvapi_QueryInterface :

Microsoft (R) COFF/PE Dumper Version 14.23.28106.4

Copyright (C) Microsoft Corporation. All rights reserved. Dump of file nvapi.dll File Type: DLL Section contains the following exports for nvapi.dll 00000000 characteristics

5DC35FBD time date stamp Thu Nov 7 02:05:17 2019

0.00 version

1 ordinal base

1 number of functions

1 number of names ordinal hint RVA name 1 0 0009FDE0 nvapi_QueryInterface …

The function nvapi_QueryInterface is used to load other internal functions using a memory address or offset. For example, a function NvAPI_Initialize has to be loaded as the following to be used:

typedef int(*NvAPI_Initialize_t)(); NvAPI_Initialize_t NvAPI_Initialize = (NvAPI_Initialize_t)(*NvAPI_QueryInterface)(0x0150E828);

A “magic number” 0x0150E828 is used as a memory address. One will have to get a list of such memory addresses from… somewhere… to be able to load other functions. It’s pretty hard to tell who can guarantee a stability of code relying on such magic numbers while using dynamic libraries. But people are doing this for real.

Despite obvious risks of using nvapi.dll and magic numbers, such an approach can be pretty temptative or really the only one, as it provides access to certain functions that are not described by official documentation and which are not available in NVML.

For example, official GPU Cooler Interface describes only a single function NvAPI_GPU_GetTachReading which allows one to get GPU’s fan speed. Docs are silent regarding functions that allow controlling fan speed, but there’s a hidden function at address 0x891FA0AE for that.

Another detail to point out is measurement units: NVML and NVAPI can have different units of measure for similar variables. For example, nvmlDeviceGetFanSpeed returns relative fan speed in % of max RPM, e.g. 42%, while NvAPI_GPU_GetTachReading gives absolute speed in RPM, e.g. 1291.

Next, NVML and NVAPI can provide different levels of granularity. For example, nvmlDeviceGetTemperature is the only NVML function for reading GPU temperature and it returns a single scalar per device. In contrast, NVAPI provides a function NvAPI_GPU_GetThermalSettings which reading temperature of a specific sensor from a specific device and there can be up to 3 thermal sensors per single device currently.

Finally, again, NVAPI is for Windows only. *Sigh*.

API usage example

The following section provides an example of GPU monitor written in C++17, which polls the following core metrics:

temperature;

fan speed;

power consumption;

GPU utilization;

GPU’s memory utilization.

The GPU API used in the example is provided by NVML library. Such choice is driven by a need not to be limited to Windows only. So, the example works both on Windows and Linux-based systems.

Only a brief overview of the monitor will be provided here to avoid throwing too much code on readers. For those curious, there’s a full version in nvidia-gpu-monitoring repository on GitHub.

Despite being written in C++, its logic is not limited to C++, though: one can use the aforementioned Python or Perl bindings to achieve the same results.

Monitor’s structure

The layout of language modules provided by the example reflects the following components:

monitor — main component which controls execution flow;

— main component which controls execution flow; nvml —wrapper for low-level NVML functions and data structures;

—wrapper for low-level NVML functions and data structures; dlib — wrapper for OS-dependent dynamic library management;

— wrapper for OS-dependent dynamic library management; utils — utilities reused by multiple modules.

Associations between these components are depicted on the component diagram below:

Figure 3 — Monitor components diagram