[gmx-developers] 4.6 beta1 not detecting GPU or CUDA

Szilárd Páll szilard.pall at cbr.su.se
Mon Dec 3 18:30:14 CET 2012


As far as I can tell, the errors are not related to GROMACS, but incorrect
system configuration or incorrectly built/linked mdrun (see specific
comments inline).

On Mon, Dec 3, 2012 at 3:02 PM, Justin Lemkul <jalemkul at vt.edu> wrote:

>
> Hi All,
>
> I'm having a strange problem and I'm hoping I can get some help diagnosing
> it. I compiled the beta release on our AMD cluster that has Tesla S2050
> GPU's, but I haven't been able to successfully run anything yet.  Our
> admins don't know much specifically about Gromacs, and it seems everything
> should be working, but somehow mdrun is not detecting CUDA or finding the
> GPU card on the compute nodes.
>
> We have CUDA 3.1, 3.2, and 4.0 available on the cluster, and I can
> replicate this problem with both 3.2 and 4.0.  It appears that mdrun is
> linked properly to the CUDA libraries:
>
> $ ldd mdrun
>         libdl.so.2 => /lib64/libdl.so.2 (0x00002aaaaacc8000)
>         libgmxpreprocess.so.6 => /home/jalemkul/ATHENA/**
> software/gromacs-46beta1/bin/.**./lib/libgmxpreprocess.so.6
> (0x00002aaaaaecc000)
>         libmd.so.6 => /home/jalemkul/ATHENA/**
> software/gromacs-46beta1/bin/.**./lib/libmd.so.6 (0x00002aaaab1b2000)
>         libgmx.so.6 => /home/jalemkul/ATHENA/**
> software/gromacs-46beta1/bin/.**./lib/libgmx.so.6 (0x00002aaaab7a2000)
>         libm.so.6 => /lib64/libm.so.6 (0x00002aaaabfd0000)
>         libcudart.so.4 => /cm/shared/apps/cuda40/**toolkit/4.0.17/lib64/**libcudart.so.4
> (0x00002aaaac253000)
>         libpthread.so.0 => /lib64/libpthread.so.0 (0x00002aaaac4c1000)
>         libfftw3f.so.3 => /home/jalemkul/ATHENA/**software/fftw-3.3.3/lib/
> **libfftw3f.so.3 (0x00002aaaac6dc000)
>         libstdc++.so.6 => /cm/shared/apps/gcc/4.3.4/**lib64/libstdc++.so.6
> (0x00002aaaaca58000)
>         libgomp.so.1 => /cm/shared/apps/gcc/4.3.4/**lib64/libgomp.so.1
> (0x00002aaaacd5f000)
>         libgcc_s.so.1 => /cm/shared/apps/gcc/4.3.4/**lib64/libgcc_s.so.1
> (0x00002aaaacf67000)
>         libc.so.6 => /lib64/libc.so.6 (0x00002aaaad17d000)
>         /lib64/ld-linux-x86-64.so.2 (0x00002aaaaaaab000)
>         librt.so.1 => /lib64/librt.so.1 (0x00002aaaad4d5000)
>
> We have a module system, so by simply using 'module load cuda40' I get all
> of the CUDA stuff loaded properly, i.e.:
>
> $ echo $LD_LIBRARY_PATH
> /cm/shared/apps/gcc/4.3.4/lib:**/cm/shared/apps/gcc/4.3.4/**
> lib64:/cm/shared/apps/cuda40/**toolkit/4.0.17/lib64:/cm/**
> local/apps/cuda40/libs/270.41.**19/usr/lib64:/cm/shared/apps/**
> cuda40/sdk/4.0.17/C/lib:/cm/**shared/apps/cuda40/sdk/4.0.17/**
> OpenCL/common/lib/Linux64
>
> So it appears mdrun should be able to load the necessary libraries.  When
> I execute the run:
>
> $ mdrun -deffnm md -nb gpu -gpu_id 0
>
> I get the following in the .log file:
>
> Log file opened on Mon Dec  3 08:55:51 2012
> Host: athena002  pid: 21368  nodeid: 0  nnodes:  1
> Gromacs version:    VERSION 4.6-beta1
> Precision:          single
> MPI library:        thread_mpi
> OpenMP support:     enabled
> GPU support:        enabled
> invsqrt routine:    gmx_software_invsqrt(x)
> CPU acceleration:   SSE2
> FFT library:        fftw-3.3.3-sse2
> Large file support: enabled
> RDTSCP usage:       enabled
> Built on:           Thu Nov 29 21:38:43 EST 2012
> Built by:           jalemkul at athena1 [CMAKE]
> Build OS/arch:      Linux 2.6.18-194.11.4.el5 x86_64
> Build CPU vendor:   AuthenticAMD
> Build CPU brand:    AMD Opteron(tm) Processor 6134
> Build CPU family:   16   Model: 9   Stepping: 1
> Build CPU features: apic clfsh cmov cx8 cx16 htt lahf_lm misalignsse mmx
> msr nonstop_tsc pdpe1gb popcnt pse rdtscp sse2 sse3 sse4a
> C compiler:         /cm/shared/apps/gcc/4.3.4/bin/**gcc GNU gcc (GCC)
> 4.3.4
> C compiler flags:   -msse2  -Wextra -Wno-missing-field-**initializers
> -Wno-sign-compare -Wall -Wno-unused -Wunused-value   -fomit-frame-pointer
> -funroll-all-loops  -O3 -DNDEBUG
> C++ compiler:       /cm/shared/apps/gcc/4.3.4/bin/**g++ GNU g++ (GCC)
> 4.3.4
> C++ compiler flags: -msse2  -Wextra -Wno-missing-field-**initializers
> -Wno-sign-compare -Wall -Wno-unused -Wunused-value   -fomit-frame-pointer
> -funroll-all-loops  -O3 -DNDEBUG
> CUDA compiler:      nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c)
> 2005-2011 NVIDIA Corporation;Built on Thu_May_12_11:09:45_PDT_2011;**Cuda
> compilation tools, release 4.0, V0.2.1221
> CUDA driver:        0.0
> CUDA runtime:       0.0
>

This is very fishy, it should be 4.0!


>
> From the last two lines, it seems the the driver and runtime libraries are
> not found.  Later in the .log file, it seems that the GPU itself is not
> found:
>
> NOTE: Error occurred during GPU detection:
>       CUDA driver version is insufficient for CUDA runtime version
>

This is the very error the CUDA runtime returns and it means that the
libcudart.so (which is the CUDA runtime library) is not compatible with the
GPU driver. You can check the driver version by running nvidia-smi.

However, this sounds quite fishy because based on the library path above
you are supposed to be using CUDA 4.0 with 270.41 drivers which should be
compatible.



>       Can not use GPU acceleration, will fall back to CPU kernels.
>
>
> No GPUs detected
>
>
> ------------------------------**-------------------------
> Program mdrun, VERSION 4.6-beta1
> Source code file: /home/jalemkul/gromacs-4.6-**
> beta1/src/gmxlib/gmx_detect_**hardware.c, line: 567
>
> Fatal error:
> Some of the requested GPUs do not exist, behave strangely, or are not
> compatible:
>     GPU #0: inexistent
>

You get this error because you explicitly requested GPU #0 which appears to
not exist because it could not be detected.


>
> For more information and tips for troubleshooting, please check the GROMACS
> website at http://www.gromacs.org/**Documentation/Errors<http://www.gromacs.org/Documentation/Errors>
> ------------------------------**-------------------------
>
> Any ideas on troubleshooting, or things I can tell our admins to help move
> things along?  I've got 4.6beta1 working fine on a local workstation in our
> lab, but the installation on the unversity's cluster is nonfunctional from
> the standpoint of the GPU.
>

Try to compile and run a simple code, something from the CUDA SDK is ideal,
e.g. deviceQuery. If that works with the same compiler setup, run-settings
on the same hardware (and shows the CUDe driver/runtime versions
correctly!), than it could be a library pre-loading issue or in worst case
a GROMACS bug.

Cheers,
--
Szilárd


>
> -Justin
>
> --
> ==============================**==========
>
> Justin A. Lemkul, Ph.D.
> Research Scientist
> Department of Biochemistry
> Virginia Tech
> Blacksburg, VA
> jalemkul[at]vt.edu | (540) 231-9080
> http://www.bevanlab.biochem.**vt.edu/Pages/Personal/justin<http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin>
>
> ==============================**==========
> --
> gmx-developers mailing list
> gmx-developers at gromacs.org
> http://lists.gromacs.org/**mailman/listinfo/gmx-**developers<http://lists.gromacs.org/mailman/listinfo/gmx-developers>
> Please don't post (un)subscribe requests to the list. Use the www
> interface or send it to gmx-developers-request@**gromacs.org<gmx-developers-request at gromacs.org>
> .
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-developers/attachments/20121203/7fc054e0/attachment.html>


More information about the gromacs.org_gmx-developers mailing list