Nvidia-Settings
Contents |
Nvidia Driver and Configuration Sites
The below sites are excellent references for Nvidia cards, drivers, and configuration settings:
http://us.download.nvidia.com/XFree86/Linux-x86/270.26/README/index.html
Really, read the driver notes above - they are packed with information. Just substitute your version as needed.
http://en.gentoo-wiki.com/wiki/Nvidia
https://wiki.archlinux.org/index.php/NVIDIA
Bug reporting
nvidia-bug-report.sh will collect information from dmesg, proc, Xorg.conf and log, /var/log/messages, etc. and compile it all together for posting on the NV forums, etc.
Xid Errors
From the 270.41.19 Driver README
My kernel log contains messages that are prefixed with "Xid"; what do these messages mean? "Xid" messages indicate that a general GPU error occurred, most often due to the driver misprogramming the GPU or to corruption of the commands sent to the GPU. These messages provide diagnostic information that can be used by NVIDIA to aid in debugging reported problems.
I have been unable to find any reference which defines the Xid error classes.
In one instance, I had a GPU which would show up fine in nvidia-smi, but any attempt to actually use it, and any use of cudaMalloc would seg fault, logging:
NVRM: Xid (0000:02:00):48, An uncorrectable double bit error (DBE) has been detected on GPU0 (00 00 00).
Turns out it appears that one of the 8 pin power harnesses, which was actually a 6+2 pin harness, did not have the 2 pin harness connected, and so the GPU was slightly underpowered. At least that seems to be the case.
Kernel Interfaces
Also from the driver readme:
How can I see the source code to the kernel interface layer?
The source files to the kernel interface layer are in the kernel directory of the extracted .run file. To get to these sources, run:
# sh NVIDIA-Linux-x86-270.41.19.run --extract-only
# cd NVIDIA-Linux-x86-270.41.19/kernel/
Persistence Mode
When running CUDA on a compute node, on which X Windows is not running, the Nvidia driver is loaded when needed, and then unloaded. This results in several seconds of sys time overhead when launching kernels, which can be eliminated by (as root) setting the GPUs to persistent mode (requires CUDA 4.0 I believe).
This eliminates reloading the driver on each kernel launch. (If a kernel crashes, you may need to manually unset and reset this.) It is unset by default.
$ time ./arraycp-gpu0 real 0m5.650s user 0m0.002s sys 0m5.500s # nvidia-smi -pm 1 Enabled persistence mode for GPU 0:2:0. Enabled persistence mode for GPU 0:3:0. $ time ./arraycp-gpu0 real 0m0.374s user 0m0.006s sys 0m0.367s
Manual Fan Control for nVIDIA Settings
NOTE: Unless you have cause, leave thermal management alone else you risk damaging your GPU.
(This entire section is duplicated from the Gentoo Wiki site here. Once again, I've found the Gentoo wiki to be without peer.
Some combinations of nvidia cards and driver versions report that fan-speed is "variable", but do not actually ever change the fan speed regardless of temperature. If you experience an unreasonably hot GPU and nvidia-settings reports your fan speed as "Variable" but never leaves its assigned value, try the below.
It's probably a good idea to read about the CoolBits option before we begin. Take a look at the nvidia-settings manual (man nvidia-settings), and the nvidia-drivers manual, available at http://us.download.nvidia.com/XFree86/Linux-x86/195.36.24/README/xconfigoptions.html
(adjust the version in the URL as appropriate - be careful about looking at out-of-date documentation about the CoolBits option!)
The CoolBits setting is also used for overclocking - make sure that you have checked this guide against the docs linked above (update this article if things have changed)
File|/etc/X11/xorg.conf
Section "Device"
...
Option "Coolbits" "5" # Check Coolbits documentation for your driver version.
...
EndSection
If your card is described in multiple "Device" sections, put the above in each of them.
Be careful setting fan speed manually - it is possible to break your card by letting it get too hot!
Inside X, run nvidia-settings. You should now find in the "Thermal Settings" section, "GPU Fan Settings" controls. My suggestion is to crank this up to 100.
You may also modify your fan speed from the command line;
Enable GPU fan control:
nvidia-settings -a [gpu:0]/GPUFanControlState=1
Find out the fan's resource id using:
nvidia-settings -q fans
Then set the speed using:
nvidia-settings -a [fan:0]/GPUCurrentFanSpeed=<n>
Where <n> is percentage of full speed.
These settings will not be permanent - to have them take effect every time that X is launched, add the below to your ~/.xinitrc File|~/.xinitrc
nvidia-settings \ -a "[gpu:0]/GPUFanControlState=1" \ -a "[fan:0]/GPUCurrentFanSpeed=100" &
KDE 4 users will need to add a symlink to ~/.xinitrc in the Autostart directory since ~/.xinitrc isn't sourced by KDM:
cd ~/.kde4/Autostart ln -s ~/.xinitrc xinitrc chmod +x ~/.xinitrc
If ~/.xinitrc is not being autostarted, then make sure your ~/.xinitrc has a shebang (#!/bin/sh) at the top.
Reference Sits for Thermal Settings:
http://us.download.nvidia.com/XFree86/Linux-x86/195.36.24/README/xconfigoptions.html
http://forums.nvidia.com/index.php?showtopic=165327
Python Fan Control Script
A python script, which uses the above nvidia-settings options, can be found at:
http://www.evga.com/forums/tm.aspx?m=903040&mpage=1
Headless Fan Control
For multiple GPUs, which are not attached to monitor, see the following:
http://blog.cryptohaze.com/2011/02/nvidia-fan-speed-control-for-headless.html
http://sites.google.com/site/akohlmey/random-hacks/nvidia-gpu-coolness
Custom EDID
http://analogbit.com/fix_nvidia_edid
Interesting bit on VGA vs. HDMI and HDCP, etc.
http://www.avsforum.com/avs-vb/showthread.php?t=997923