Some of the content in this article is most likely out of date, as it was written on. For newer information, see our more recent articles

The latest addition to the NVIDIA GeForce line, the GTX 980 Ti, is a significant improvement over the GTX 980 for single precision compute tasks using CUDA. It rivals, and is very similar to, the astounding performance of the of the Titan X. The most significant difference is that the GTX 980 Ti's 6GB of memory is half that of the Titan X's 12GB (but 2GB more than the 980's 4GB). For computational work the extra memory of the Titan X may be important to you. It can make life easier when you are working on new code and haven't optimized the memory buffering to keep the GPU loaded. In general, more memory is always a good thing. The Titan X is so fast for single precision calculations that the extra memory can help keep large workloads flowing. However, many CUDA accelerated programs have excellent memory buffering from host to card and 6GB will be enough to keep those CUDA cores loaded.

I'll be adding the GTX 980 Ti to my GPU computing testing results from here out. I should have a post up on some Molecular Modeling applications soon. I've been working with GROMACS lately and I suspect that the 980 Ti may be the card of choice for this. For now I just want to show a simple CUDA benchmark on Linux so you can see what to expect for serious compute performance from the 980 Ti.

Getting the GTX 980 Ti to work on Linux

For some reason the 980 Ti was more of a problem to get working under Linux than the Titan X when it was first released. The Titan X "just worked" under Linux even before there was a functioning Windows driver! The 980 Ti required the latest NVIDIA beta driver!

System setup;

Dual 10 core Xeon 2687W v3 @ 3.1 GHz

64GB DDR4-2133 Reg ECC

NVIDIA Titan X 12GB X16 1000MHz core clock 3072 cuda cores

NVIDIA GTX 980 Ti 6GB X16 1000MHz core clock 2816 cuda cores

NVIDIA GTX 980 4GB X16 1126MHz core clock 2048 cuda cores

Linux CentOS 7

NVIDIA kernel driver modules version 352.09 (beta)

Make careful note of the NVIDIA driver version! I had to use the most recent beta driver to get the GTX 980 Ti to work. That means I had to use the downloaded shell archive .run file which means the driver install is outside of the normal package management and all of the usual difficulties with that situation apply. I generally prefer to have the NVIDIA driver installed from the "package managed" cuda repo.

The driver I used was,

http://us.download.nvidia.com/XFree86/Linux-x86_64/352.09/NVIDIA-Linux-x86_64-352.09.run

I suggest that you look at my post Install NVIDIA CUDA on Fedora 22 with gcc 5.1 for an approach you might want to consider taking to get CUDA and the beta driver installed. [ I do a CUDA install from the NVIDIA repo that fails with the driver install (but sets up everything for it and CUDA) and then install the NVIDIA shell archive .run file -- crazy but simple and effective! ]

The NVIDIA display driver from the current cuda 7.0-28 repo does not support the GTX 980 Ti. I had cuda 7 installed on the system already when I added the 980 Ti to the mix. It is device 1 in the following nvidia-smi output.

[kinghorn@tower Downloads]$ nvidia-smi Mon Jun 1 15:31:35 2015 +------------------------------------------------------+ | NVIDIA-SMI 346.46 Driver Version: 346.46 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 980 Off | 0000:03:00.0 N/A | N/A | | 26% 37C P8 N/A / N/A | 116MiB / 4095MiB | N/A Default | +-------------------------------+----------------------+----------------------+ | 1 ERR! Off | 0000:04:00.0 N/A | N/A | | 22% 36C P8 N/A / N/A | 20MiB / 6143MiB | N/A Default | +-------------------------------+----------------------+----------------------+ | 2 Graphics Device Off | 0000:81:00.0 Off | N/A | | 22% 31C P8 15W / 250W | 23MiB / 12287MiB | 0% Default | +-------------------------------+----------------------+----------------------+

Note: Device 2 is the Titan X which shows up as "Graphics Device" with the 346.46 display driver. Device 1 with the "ERR!" is the GTX 980 Ti.

After the 352.09 beta driver install we have,

[kinghorn@tower ~]$ nvidia-smi Mon Jun 1 15:44:00 2015 +------------------------------------------------------+ | NVIDIA-SMI 352.09 Driver Version: 352.09 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 980 Off | 0000:03:00.0 N/A | N/A | | 27% 46C P2 N/A / N/A | 81MiB / 4095MiB | N/A Default | +-------------------------------+----------------------+----------------------+ | 1 Graphics Device Off | 0000:04:00.0 N/A | N/A | | 22% 45C P8 N/A / N/A | 20MiB / 6143MiB | N/A Default | +-------------------------------+----------------------+----------------------+ | 2 GeForce GTX TIT... Off | 0000:81:00.0 Off | N/A | | 22% 39C P8 16W / 250W | 23MiB / 12287MiB | 0% Default | +-------------------------------+----------------------+----------------------+

Yea! The GTX 980 Ti, device 1, is working and now has the distinguished title "Graphics Device" and the Titan X gets it's proper name.

For a quick test of the CUDA single precision floating point performance here's the results from running benchmark mode with the nbody simulation from the cuda samples.

The GTX 980 Ti is going to be a great card for single precision compute loads!

Following is some more detailed output from the nbody simulation and deviceQuery output for your enjoyment :-)

[kinghorn@tower release]$ nvidia-smi -L GPU 0: GeForce GTX 980 (UUID: GPU-477f0fd5-9db5-a015-e8b3-15ac96a06920) GPU 1: Graphics Device (UUID: GPU-42f87dff-242a-17b2-7faf-2b4e18aec0d8) GPU 2: GeForce GTX TITAN X (UUID: GPU-f195b1fa-16ec-ea58-8a17-146c0f93930e)

[kinghorn@tower release]$ ./nbody -benchmark -numbodies=256000 -device=0 ... gpuDeviceInit() CUDA Device [0]: "GeForce GTX TITAN X > Compute 5.2 CUDA device: [GeForce GTX TITAN X] number of bodies = 256000 256000 bodies, total time for 10 iterations: 3598.562 ms = 182.117 billion interactions per second = 3642.344 single-precision GFLOP/s at 20 flops per interaction [kinghorn@tower release]$ ./nbody -benchmark -numbodies=256000 -device=1 ... gpuDeviceInit() CUDA Device [1]: "Graphics Device > Compute 5.2 CUDA device: [Graphics Device] number of bodies = 256000 256000 bodies, total time for 10 iterations: 3906.858 ms = 167.746 billion interactions per second = 3354.921 single-precision GFLOP/s at 20 flops per interaction [kinghorn@tower release]$ ./nbody -benchmark -numbodies=256000 -device=2 ... gpuDeviceInit() CUDA Device [2]: "GeForce GTX 980 > Compute 5.2 CUDA device: [GeForce GTX 980] number of bodies = 256000 256000 bodies, total time for 10 iterations: 5129.597 ms = 127.761 billion interactions per second = 2555.210 single-precision GFLOP/s at 20 flops per interaction

[kinghorn@tower release]$ ./deviceQuery ./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 3 CUDA Capable device(s) Device 0: "GeForce GTX TITAN X" CUDA Driver Version / Runtime Version 7.5 / 7.0 CUDA Capability Major/Minor version number: 5.2 Total amount of global memory: 12288 MBytes (12884705280 bytes) (24) Multiprocessors, (128) CUDA Cores/MP: 3072 CUDA Cores GPU Max Clock rate: 1076 MHz (1.08 GHz) Memory Clock rate: 3505 Mhz Memory Bus Width: 384-bit L2 Cache Size: 3145728 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 129 / 0 Compute Mode: Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) Device 1: "Graphics Device" CUDA Driver Version / Runtime Version 7.5 / 7.0 CUDA Capability Major/Minor version number: 5.2 Total amount of global memory: 6144 MBytes (6442254336 bytes) (22) Multiprocessors, (128) CUDA Cores/MP: 2816 CUDA Cores GPU Max Clock rate: 1076 MHz (1.08 GHz) Memory Clock rate: 3505 Mhz Memory Bus Width: 384-bit L2 Cache Size: 3145728 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 4 / 0 Compute Mode: Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) Device 2: "GeForce GTX 980" CUDA Driver Version / Runtime Version 7.5 / 7.0 CUDA Capability Major/Minor version number: 5.2 Total amount of global memory: 4095 MBytes (4294246400 bytes) (16) Multiprocessors, (128) CUDA Cores/MP: 2048 CUDA Cores GPU Max Clock rate: 1216 MHz (1.22 GHz) Memory Clock rate: 3505 Mhz Memory Bus Width: 256-bit L2 Cache Size: 2097152 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 3 / 0 Compute Mode: Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > Peer access from GeForce GTX TITAN X (GPU0) -> Graphics Device (GPU1) : No > Peer access from GeForce GTX TITAN X (GPU0) -> GeForce GTX 980 (GPU2) : No > Peer access from Graphics Device (GPU1) -> Graphics Device (GPU1) : No > Peer access from Graphics Device (GPU1) -> GeForce GTX 980 (GPU2) : No > Peer access from Graphics Device (GPU1) -> GeForce GTX TITAN X (GPU0) : No > Peer access from Graphics Device (GPU1) -> Graphics Device (GPU1) : No > Peer access from GeForce GTX 980 (GPU2) -> GeForce GTX TITAN X (GPU0) : No > Peer access from GeForce GTX 980 (GPU2) -> Graphics Device (GPU1) : No deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.0, NumDevs = 3, Device0 = GeForce GTX TITAN X, Device1 = Graphics Device, Device2 = GeForce GTX 980 Result = PASS

Happy computing! --dbk >/p>